====== proc init {} { variable map variable alphanumeric a-zA-Z0-9 for {set i 0} {$i <= 256} {incr i} { set c [format %c $i] if {![string match \[$alphanumeric\] $c]} { set map($c) %[format %.2x $i] } } # These are handled specially array set map { " " + \n %0d%0a } } init proc url-encode {string} { variable map variable alphanumeric # The spec says: "non-alphanumeric characters are replaced by '%HH'" # 1 leave alphanumerics characters alone # 2 Convert every other character to an array lookup # 3 Escape constructs that are "special" to the tcl parser # 4 "subst" the result, doing all the array substitutions regsub -all \[^$alphanumeric\] $string {$map(&)} string # This quotes cases like $map([) or $map($) => $map(\[) ... regsub -all {[][{})\\]\)} $string {\\&} string return [subst -nocommand $string] } proc url-decode str { # rewrite "+" back to space # protect \ from quoting another '\' set str [string map [list + { } "\\" "\\\\"] $str] # prepare to process all %-escapes regsub -all -- {%([A-Fa-f0-9][A-Fa-f0-9])} $str {\\u00\1} str # process \u unicode mapped chars return [subst -novar -nocommand $str] } ====== This is almost exactly source taken from the implementations of [http] (except http has moved to a C implementation? for good?) and [ncgi] (and should ncgi re-use http's command?). ---- 18may05 [jcw] - With 8.4, this ought to do the same: ====== proc ue_init {} { lappend d + { } for {set i 0} {$i < 256} {incr i} { set c [format %c $i] set x %[format %02x $i] if {![string match {[a-zA-Z0-9]} $c]} { lappend e $c $x lappend d $x $c } } set ::ue_map $e set ::ud_map $d } ue_init proc ue {s} { string map $::ue_map $s } proc ud {s} { string map $::ud_map $s } puts [ue "wiki.tcl.tk/is fun!"] puts [ud [ue "wiki.tcl.tk/is fun!"]] puts [ue "a space and a \n new line :)"] puts [ud [ue "a space and a \n new line :)"]] puts [ud "1+1=2"] ====== [[Certain? [[ue]] appears to me to map ' '->'%20', while [[url-encode]] sends ' '->'+'.]] [[Let me add, though, that I very much appreciate the elegance and flexibility of these recodings.]] ---- [Lars H]: This encodes a string of bytes using printable [ASCII], but what can/should be done if one wants to use arbitrary Unicode characters in a string? I suppose that question mostly boils down to "which [encoding] is used for URLs?" Are there any RFCs or the like that specifies that? [[Yes, so that's one job: find the correct name of this translation ("x-url-encoding"?) and its official specification.]] [Lars H]: The person who speaks in [[brackets]] here seems to have misunderstood my point. The x-url-encoding is, as far as I can tell (the fact that almost every occurrence of ''x-url-encoding'' that turns up in Google is a Tcl manpage speaks against this being an official name), what is implemented on this page, but my point was rather how to go beyond anglocentric URLs. What if I want to use the string "хлеб" or "борщ" in a web address, then how should I encode it? [Lars H], 8 June 2005: I am now able to partially answer my question. The primary reference on URLs appears to be [RFC] 3986 [http://www.faqs.org/rfcs/rfc3986.html], and the attitude there mainly seems to be that a Uniform Resource Identifier is a sequence of ''octets'' (which is mostly RFC-speak for "bytes") -- some octet values correspond to characters via the US-ASCII encoding, and some of those furthermore have a special role in the URI syntax, but that's mostly for the convenience of the users. The manner in which the octet sequences are chosen, and to what extent they may correspond to arbitrary character strings, is up to each individual scheme (http, ftp, etc.) to define. As far as I can tell no scheme specification makes such a definition! To me, this feels a lot like an antiquated approach, but it probably was what was already implemented in the major browsers. The older (and now obsolete) RFC 2396 [http://www.faqs.org/rfcs/rfc2396.html] instead employed the more modern philosophy that characters (as in Tcl) are the fundamental units and that these are more than mere bytes, but a specification of how that should work was left for a future amendment (which never came) to define. As to the "x-url-encoding", it seems the proper term is '''percent-encoded'''. [Lars H], 14 June 2005: More discoveries. RFC 3490 [http://www.faqs.org/rfcs/rfc3490.html] describes an encoding ([punycode]) of internationalised domain names as ASCII. This is supposed to happen at the application level, which means Tcl programs will need to do this explicitly. ---- !!!!!! %| [Category Internet] |% !!!!!!