While [DG] was making an [RSS] reader/formatter for [tclhttpd] for dynamic page generation of news feeds, I bumped into some nasty problems with encodings and thought I'd share. It seems that [tDOM]'s [[dom parse]] command subverts the actual string you see and hands expat's XML_Parse() routine Tcl's internal string rep. I couldn't for the life of me debug why I was getting garbage characters here and there. Here's an example of a bad character: (Desktop) 211 % [dom parse {ä}] asXML ��¤ So we have the character ä, which is a member of ISO-8859-1, we are claiming ISO-8859-1 in the declaration, yet we are getting garbage back(!?) If we look at other aspects of the core that manipulate encodings, we see that iso8859-1 is a pure pass-thru: (Desktop) 144 % proc a {str} {return [encoding convertto iso8859-1 $str]} (Desktop) 145 % a [a [a [a [a [a [a [a [a für]]]]]]]] für (Desktop) 146 % proc b {str} {return [encoding convertfrom iso8859-1 $str]} (Desktop) 147 % b [b [b [b [b [b [b [b [b für]]]]]]]] für As it isn't possible to use the [encoding] command to get to a level lower Tcl's internal rep to fix what it's internal rep would be ('''is this level -1?'''), my fix for this had to end up in tDOM's C code. Any Tcl command that moves strings to outside APIs must externalize them with [Tcl_UtfToExternalDString]. Email me for the patch if you want it. I added an '-externalizestring' to the parse command to get a loss-less mode. After I did this it all made sense: (Desktop) 3 % [dom parse -externalizestring {ä}] asXML ä ä == ä. The string I can see is the string sent to XML_Parse(), has nothing outside the claimed range of ISO-8859-1 and I get back what I sent it. Perfectly loss-less, gain-less, and neutral. (Desktop) 3 % [dom parse -externalizestring [encoding convertto unicode "ä"]] asXML ä Even better, I see I'm turning the string into 16-bit, claiming 16-bit, and it all works without me having to consider how Tcl manages the bits of a string object internally. Very [WYSIWYG] as what I see is what XML_Parse gets! Ahh.. the purity of all :) My 'fetch' routine is the following: # returns the DOM object of the RSS feed. proc tmlrss::fetchXML {uri {recurse_limit 4}} { # lie like a senator for google to stop giving me a 401.. http::config -useragent {Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5} set token [http::geturl $uri] upvar #0 $token state if {[http::status $token] != "ok" || [http::ncode $token] != 200} { # was the error a redirect? If so, do it.. if {[http::ncode $token] == 302 && [incr recurse_limit -1] > 0} { array set meta $state(meta) set result [fetchXML $meta(Location) $recurse_limit] http::cleanup $token return $result } set err [http::code $token] http::cleanup $token return -code error $err } set xml [http::data $token] array set meta $state(meta) http::cleanup $token # Do we need to let encoding conversions happen or was it already done in transit? if {[info exist meta(Content-Type)] && \ [regexp -nocase {charset\s*=\s*(\S+)} $meta(Content-Type)]} { # socket channel encodings already performed! So strip the XML # declaration, should it exist. return [dom parse -baseurl [uriBase $uri] [stripXmlDecl $xml]] } else { return [dom parse -externalizestring -baseurl [uriBase $uri] $xml] } } The http package that comes with the core is very full featured, and when the Content-Type header contains a charset declaration, the stream data is converted for us. But sometimes, there won't be one, too. As the RFCs state, server Content-Type: overrides any XML declaration, so when we know the http package did a conversion, disable the XML parser from doing it a second time by removing it. proc tmlrss::stripXmlDecl {xml} { if {![binary scan [string range $xml 0 3] "H8" firstBytes]} { # very short (< 4 Bytes) string return $xml } # If the entity has an XML Declaration, the first four characters # must be "" $xml] if {$closeIndex == -1} { error "Weird XML data or not XML data at all" } set xml [string range $xml [expr {$closeIndex+1}] end] } default { # no declaration. } } return $xml } There, all wacky encoding problems fixed :) ---- [Category XML]