While DG was making an RSS reader/formatter for tclhttpd for dynamic page generation of news feeds, I bumped into some nasty problems with encodings and thought I'd share.
It seems that tDOM's [dom parse] command subverts the actual string you see and hands expat's XML_Parse() routine Tcl's internal string rep. I couldn't for the life of me debug why I was getting garbage characters here and there. Here's an example of a bad character:
(Desktop) 211 % [dom parse {<?xml version="1.0" encoding="ISO-8859-1"?><foo>ä</foo>}] asXML <foo>��¤</foo>
So we have the character ä, which is a member of ISO-8859-1, we are claiming ISO-8859-1 in the declaration, yet we are getting garbage back(!?) If we look at other aspects of the core that manipulate encodings, we see that iso8859-1 is a pure pass-thru:
(Desktop) 144 % proc a {str} {return [encoding convertto iso8859-1 $str]} (Desktop) 145 % a [a [a [a [a [a [a [a [a für]]]]]]]] für (Desktop) 146 % proc b {str} {return [encoding convertfrom iso8859-1 $str]} (Desktop) 147 % b [b [b [b [b [b [b [b [b für]]]]]]]] für
As it isn't possible to use the encoding command to get to a level lower Tcl's internal rep to fix what it's internal rep would be (is this level -1?), my fix for this had to end up in tDOM's C code. Any Tcl command that moves strings to outside APIs must externalize them with Tcl_UtfToExternalDString.
Email me for the patch if you want it. I added an '-externalizestring' to the parse command to get a loss-less mode. After I did this it all made sense:
(Desktop) 3 % [dom parse -externalizestring {<?xml version="1.0" encoding="ISO-8859-1"?><foo>ä</foo>}] asXML <foo>ä</foo>
ä == ä. The string I can see is the string sent to XML_Parse(), has nothing outside the claimed range of ISO-8859-1 and I get back what I sent it. Perfectly loss-less, gain-less, and neutral.
(Desktop) 3 % [dom parse -externalizestring [encoding convertto unicode "<?xml version=\"1.0\" encoding=\"UTF-16\"?><foo>ä</foo>"]] asXML <foo>ä</foo>
Even better, I see I'm turning the string into 16-bit, claiming 16-bit, and it all works. Very WSIWYG
My 'fetch' routine is the following:
# returns the DOM object of the RSS feed. proc tmlrss::fetchXML {uri {recurse_limit 4}} { # lie like a senator for google to stop giving me a 401.. http::config -useragent {Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5} set token [http::geturl $uri] upvar #0 $token state if {[http::status $token] != "ok" || [http::ncode $token] != 200} { # was the error a redirect? If so, do it.. if {[http::ncode $token] == 302 && [incr recurse_limit -1] > 0} { array set meta $state(meta) set result [fetchXML $meta(Location) $recurse_limit] http::cleanup $token return $result } set err [http::code $token] http::cleanup $token return -code error $err } set xml [http::data $token] array set meta $state(meta) http::cleanup $token # Do we need to let encoding conversions happen or was it already done in transit? if {[info exist meta(Content-Type)] && \ [regexp -nocase {charset\s*=\s*(\S+)} $meta(Content-Type)]} { # socket channel encodings already performed! So strip the XML # declaration, should it exist. return [dom parse -baseurl [uriBase $uri] [stripXmlDecl $xml]] } else { return [dom parse -externalizestring -baseurl [uriBase $uri] $xml] } }
The http package that comes with the core is very full featured, and when the Content-Type header contains a charset declaration, the stream data is converted for us. But sometimes, there won't be one, too. As the RFCs state, server Content-Type: overrides any XML declaration, so when we know the http package did a conversion, disable the XML parser from doing it a second time by removing it.
proc tmlrss::stripXmlDecl {xml} { if {![binary scan [string range $xml 0 3] "H8" firstBytes]} { # very short (< 4 Bytes) string return $xml } # If the entity has an XML Declaration, the first four characters # must be "<?xm". switch $firstBytes { "3c3f786d" { # Try to find the end of the XML Declaration set closeIndex [string first ">" $xml] if {$closeIndex == -1} { error "Weird XML data or not XML data at all" } set xml [string range $xml [expr {$closeIndex+1}] end] } default { # no declaration. } } return $xml }
There, all wacky encoding problems fixed :)