Version 23 of url-encoding

Updated 2007-12-07 11:03:43 by dkf
proc init {} {
    variable map
    variable alphanumeric a-zA-Z0-9
    for {set i 0} {$i <= 256} {incr i} { 
        set c [format %c $i]
        if {![string match \[$alphanumeric\] $c]} {
            set map($c) %[format %.2x $i]
        }
    }
    # These are handled specially
    array set map { " " + \n %0d%0a }
}
init

proc url-encode {string} {
    variable map
    variable alphanumeric

    # The spec says: "non-alphanumeric characters are replaced by '%HH'"
    # 1 leave alphanumerics characters alone
    # 2 Convert every other character to an array lookup
    # 3 Escape constructs that are "special" to the tcl parser
    # 4 "subst" the result, doing all the array substitutions

    regsub -all \[^$alphanumeric\] $string {$map(&)} string
    # This quotes cases like $map([) or $map($) => $map(\[) ...
    regsub -all {[][{})\\]\)} $string {\\&} string
    return [subst -nocommand $string]
}

proc url-decode str {
    # rewrite "+" back to space
    # protect \ from quoting another '\'
    set str [string map [list + { } "\\" "\\\\"] $str]

    # prepare to process all %-escapes
    regsub -all -- {%([A-Fa-f0-9][A-Fa-f0-9])} $str {\\u00\1} str

    # process \u unicode mapped chars
    return [subst -novar -nocommand $str]
}

This is almost exactly source taken from the implementations of http (except http has moved to a C implementation? for good?) and ncgi (and should ncgi re-use http's command?).


18may05 jcw - With 8.4, this ought to do the same:

proc ue_init {} {
   lappend d + { }
   for {set i 0} {$i < 256} {incr i} {
      set c [format %c $i]
      set x %[format %02x $i]
      if {![string match {[a-zA-Z0-9]} $c]} {
         lappend e $c $x
         lappend d $x $c
      }
   }
   set ::ue_map $e
   set ::ud_map $d
}
ue_init
proc ue {s} { string map $::ue_map $s }
proc ud {s} { string map $::ud_map $s }

puts [ue "wiki.tcl.tk/is fun!"]
puts [ud [ue "wiki.tcl.tk/is fun!"]]
puts [ue "a space and a \n new line :)"]
puts [ud [ue "a space and a \n new line :)"]]
puts [ud "1+1=2"]

[Certain? [ue] appears to me to map ' '->'%20', while [url-encode] sends ' '->'+'.] [Let me add, though, that I very much appreciate the elegance and flexibility of these recodings.]


Lars H: This encodes a string of bytes using printable ASCII, but what can/should be done if one wants to use arbitrary Unicode characters in a string? I suppose that question mostly boils down to "which encoding is used for URLs?" Are there any RFCs or the like that specifies that?

[Yes, so that's one job: find the correct name of this translation ("x-url-encoding"?) and its official specification.]

Lars H: The person who speaks in [brackets] here seems to have misunderstood my point. The x-url-encoding is, as far as I can tell (the fact that almost every occurrence of x-url-encoding that turns up in Google is a Tcl manpage speaks against this being an official name), what is implemented on this page, but my point was rather how to go beyond anglocentric URLs. What if I want to use the string "хлеб" or "борщ" in a web address, then how should I encode it?

Lars H, 8 June 2005: I am now able to partially answer my question. The primary reference on URLs appears to be RFC 3986 [L1 ], and the attitude there mainly seems to be that a Uniform Resource Identifier is a sequence of octets (which is mostly RFC-speak for "bytes") -- some octet values correspond to characters via the US-ASCII encoding, and some of those furthermore have a special role in the URI syntax, but that's mostly for the convenience of the users. The manner in which the octet sequences are chosen, and to what extent they may correspond to arbitrary character strings, is up to each individual scheme (http, ftp, etc.) to define. As far as I can tell no scheme specification makes such a definition!

To me, this feels a lot like an antiquated approach, but it probably was what was already implemented in the major browsers. The older (and now obsolete) RFC 2396 [L2 ] instead employed the more modern philosophy that characters (as in Tcl) are the fundamental units and that these are more than mere bytes, but a specification of how that should work was left for a future amendment (which never came) to define.

As to the "x-url-encoding", it seems the proper term is percent-encoded.

Lars H, 14 June 2005: More discoveries. RFC 3490 [L3 ] describes an encoding of internationalised domain names as ASCII. This is supposed to happen at the application level, which means Tcl programs will need to do this explicitly.


[ Category Internet ]