url-encoding

proc init {} {
    variable map
    variable alphanumeric a-zA-Z0-9
    for {set i 0} {$i <= 256} {incr i} { 
        set c [format %c $i]
        if {![string match \[$alphanumeric\] $c]} {
            set map($c) %[format %.2x $i]
        }
    }
    # These are handled specially
    array set map { " " + \n %0d%0a }
}
init

proc url-encode {string} {
    variable map
    variable alphanumeric

    # The spec says: "non-alphanumeric characters are replaced by '%HH'"
    # 1 leave alphanumerics characters alone
    # 2 Convert every other character to an array lookup
    # 3 Escape constructs that are "special" to the tcl parser
    # 4 "subst" the result, doing all the array substitutions

    regsub -all \[^$alphanumeric\] $string {$map(&)} string
    # This quotes cases like $map([) or $map($) => $map(\[) ...
    regsub -all {[][{})\\]\)} $string {\\&} string
    return [subst -nocommand $string]
}

proc url-decode str {
    # rewrite "+" back to space
    # protect \ from quoting another '\'
    set str [string map [list + { } "\\" "\\\\"] $str]

    # prepare to process all %-escapes
    regsub -all -- {%([A-Fa-f0-9][A-Fa-f0-9])} $str {\\u00\1} str

    # process \u unicode mapped chars
    return [subst -novar -nocommand $str]
}

This is almost exactly source taken from the implementations of http (except http has moved to a C implementation? for good?) and ncgi (and should ncgi re-use http's command?).


samoc 20140603: The following tweak adds support for UTF-8:

Source code here...

proc utf8 {hex} {
    set hex [string map {% {}} $hex]
    encoding convertfrom utf-8 [binary decode hex $hex]
}

proc url-decode str {
    # rewrite "+" back to space
    # protect \ from quoting another '\'
    set str [string map [list + { } "\\" "\\\\"] $str]

    # Replace UTF-8 sequences with calls to the utf8 decode proc...
    regsub -all {(%[0-9A-Fa-f0-9]{2})+} $str {[utf8 \0]} str
    
    return [subst -novar -noback $str]
}

31th August 2012 he The same as a namespace ensemble. This provides us some online help for free: "unknown or ambiguous subcommand "edecode": must be decode, or encode".

namespace eval url {
        variable map
        variable alphanumeric a-zA-Z0-9._~-
        namespace export encode decode
        namespace ensemble create
}
proc url::init {} {
        variable map
        variable alphanumeric a-zA-Z0-9._~-

        for {set i 0} {$i <= 256} {incr i} { 
                set c [format %c $i]
                if {![string match \[$alphanumeric\] $c]} {
                        set map($c) %[format %.2x $i]
                }
        }
        # These are handled specially
        array set map { " " + \n %0d%0a }
}
url::init
proc url::encode {str} {
        variable map
        variable alphanumeric

        # The spec says: "non-alphanumeric characters are replaced by '%HH'"
        # 1 leave alphanumerics characters alone
        # 2 Convert every other character to an array lookup
        # 3 Escape constructs that are "special" to the tcl parser
        # 4 "subst" the result, doing all the array substitutions

        regsub -all \[^$alphanumeric\] $str {$map(&)} str
        # This quotes cases like $map([) or $map($) => $map(\[) ...
        regsub -all {[][{})\\]\)} $str {\\&} str
        return [subst -nocommand $str]
}
proc url::decode {str} {
        # rewrite "+" back to space
        # protect \ from quoting another '\'
        set str [string map [list + { } "\\" "\\\\"] $str]

        # prepare to process all %-escapes
        regsub -all -- {%([A-Fa-f0-9][A-Fa-f0-9])} $str {\\u00\1} str

        # process \u unicode mapped chars
        return [subst -novar -nocommand $str]
}

18may05 jcw - With 8.4, this ought to do the same:

proc ue_init {} {
   lappend d + { }
   for {set i 0} {$i < 256} {incr i} {
      set c [format %c $i]
      set x %[format %02x $i]
      if {![string match {[a-zA-Z0-9]} $c]} {
         lappend e $c $x
         lappend d $x $c
      }
   }
   set ::ue_map $e
   set ::ud_map $d
}
ue_init
proc ue {s} { string map $::ue_map $s }
proc ud {s} { string map $::ud_map $s }

puts [ue "wiki.tcl.tk/is fun!"]
puts [ud [ue "wiki.tcl.tk/is fun!"]]
puts [ue "a space and a \n new line :)"]
puts [ud [ue "a space and a \n new line :)"]]
puts [ud "1+1=2"]

[Certain? [ue] appears to me to map ' '->'%20', while [url-encode] sends ' '->'+'.] [Let me add, though, that I very much appreciate the elegance and flexibility of these recodings.]

JM 2/17/2012 - Make sure you string tolower your string if your encoding is uppercase when using the proc ud above...


Lars H: This encodes a string of bytes using printable ASCII, but what can/should be done if one wants to use arbitrary Unicode characters in a string? I suppose that question mostly boils down to "which encoding is used for URLs?" Are there any RFCs or the like that specifies that?

[Yes, so that's one job: find the correct name of this translation ("x-url-encoding"?) and its official specification.]

Lars H: The person who speaks in [brackets] here seems to have misunderstood my point. The x-url-encoding is, as far as I can tell (the fact that almost every occurrence of x-url-encoding that turns up in Google is a Tcl manpage speaks against this being an official name), what is implemented on this page, but my point was rather how to go beyond anglocentric URLs. What if I want to use the string "хлеб" or "борщ" in a web address, then how should I encode it?

Lars H, 8 June 2005: I am now able to partially answer my question. The primary reference on URLs appears to be RFC 3986 [L1 ], and the attitude there mainly seems to be that a Uniform Resource Identifier is a sequence of octets (which is mostly RFC-speak for "bytes") -- some octet values correspond to characters via the US-ASCII encoding, and some of those furthermore have a special role in the URI syntax, but that's mostly for the convenience of the users. The manner in which the octet sequences are chosen, and to what extent they may correspond to arbitrary character strings, is up to each individual scheme (http, ftp, etc.) to define. As far as I can tell no scheme specification makes such a definition!

To me, this feels a lot like an antiquated approach, but it probably was what was already implemented in the major browsers. The older (and now obsolete) RFC 2396 [L2 ] instead employed the more modern philosophy that characters (as in Tcl) are the fundamental units and that these are more than mere bytes, but a specification of how that should work was left for a future amendment (which never came) to define.

As to the "x-url-encoding", it seems the proper term is percent-encoded.

Lars H, 14 June 2005: More discoveries. RFC 3490 [L3 ] describes an encoding (punycode) of internationalised domain names as ASCII. This is supposed to happen at the application level, which means Tcl programs will need to do this explicitly.

APN Google seems to use UTF-8. So encoding the query goes something like Unicode string -> utf-8 stream -> url-encode.


HaO 2012-03-06: The following iterative decoder also detects errors, is slower and does not need a table.

proc expandPercent {data} {
    set pos 0
    while { -1 != [set pos [string first "%" $data $pos]]} {
        set hexNumber "0x[string range $data $pos+1 $pos+2]"
        if { 4 != [string length $hexNumber] || ! [string is integer $hexNumber] } {
            # No two hex character - eventual error treatment here
            # at the moment just leave the percent character
        } else {
            set data [string range $data 0 $pos-1][format %c $hexNumber][string range $data $pos+3 end]
        }
        incr pos
    }
    return $data
}

Ciencia Al Poder 2012-12-10: I made this procedure to encode URLs, leaving legit characters unencoded. It also supports UNICODE (multibyte characters). Hope it's useful. I'm still newbie on tcl so forgive me if I made something really weird here, and feel free to ammend the code if it can be improved.

Note: The function is for encoding query string components: For example, in http://example.org/path/sctipt.php?name1=value1&name2=value2 the script can be used to encode name1, value1, name2 and value2 by passing only those strings. Don't pass the entire URL or the entire query string to the function.

proc UrlEncode {str} {
    # The following line is needed for some unicode characters. If you get double-encoded unicode characters remove it
    set str [encoding convertto utf-8 $str]
    set szRet ""
    # Characters that will be encoded -- everything except those
    # Note that we use "+" at the end to match a range of characters if there are several that need
    # encoding next to each other, instead of going one by one. This should make the script faster ;)
    set arIdxToEncode [regexp -inline -indices -all -- {[^a-zA-Z0-9\-_.: /]+} $str]
    if {[llength $arIdxToEncode]} {
        # Next character that needs to be appended to the resutling string
        set nCurCar 0
        # $arIdxToEncode contains a list of characters that need encoding
        foreach arIdxPairs $arIdxToEncode {
            # First element: start character
            set n [lindex $arIdxPairs 0]
            set nLastCar [lindex $arIdxPairs 1]
            # Append previous text that doesn't need to be encoded
            if {$nCurCar < $n} {
                append szRet [string range $str $nCurCar [expr {$n - 1}]]
            }
            # Encode text matched
            while {$n <= $nLastCar} {
                set nCode [scan [string index $str $n] "%c"]
                #append szRet [format "%x" $nCode]
                set szHexStream [string toupper [format "%x" $nCode]]
                # We need a string of even characters. Prepend 0 if odd
                if {[expr {[string length $szHexStream] % 2}]} {
                    set szHexStream "0$szHexStream"
                }
                # Insert % every 2 characters
                for {set nStreamIdx 0} {$nStreamIdx < [string length $szHexStream]} {set nStreamIdx [expr {$nStreamIdx + 2}]} {
                    append szRet "%" [string range $szHexStream $nStreamIdx [expr {$nStreamIdx + 1}]]
                }
                incr n
                set nCurCar $n
            }
        }
        # Append the text after the last match that doesn't need to be encoded
        if {$nCurCar < [string length $str]} {
            append szRet [string range $str $nCurCar [expr {[string length $str] - 1}]]
        }
    } else {
        # No special characters found
        set szRet $str
    }
    return $szRet
}