While I was making an RSS reader/formatter for tclhttpd for dynamic page generation of news feeds, I bumped into some nasty problems with encodings and thought I'd share.
My 'fetch' routine is the following:
# returns the DOM object of the RSS feed. proc fetchXML {uri {recurse_limit 4}} { # Lie like a senator for google to stop giving me a 401.. http::config -useragent "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5" set token [http::geturl $uri] upvar #0 $token state if {[http::status $token] != "ok" || [http::ncode $token] != 200} { # was the error a redirect? If so, do it.. if {[http::ncode $token] == 302 && [incr recurse_limit -1] > 0} { array set meta $state(meta) set result [fetchXML $meta(Location) $recurse_limit] http::cleanup $token return $result } set err [http::code $token] http::cleanup $token return -code error $err } set xml [http::data $token] array set meta $state(meta) http::cleanup $token # Do we need to do an encoding translation or was it already done? # -=[SEE NOTE BELOW]=- if {[info exist meta(Content-Type)] && \ [regexp -nocase {charset\s*=\s*(\S+)} $meta(Content-Type)]} { # socket channel encodings already performed! No additional work needed. } else { # manually perform encoding translations using XML header info. set xml [encoding convertfrom [tmlrss::getXmlEncoding $xml] $xml] } return [dom parse -baseurl [uriBase $uri] $xml] }
The http package that comes with the core is very full featured, and when the Content-Type header contains a charset declaration, the stream data is converted for us. But sometimes, there won't be one, too.
The [dom parse] command for tDOM doesn't read the XML document header for charset either, so I made the following based on tDOM's [tDOM::xmlOpenFile] found in the package's tdom.tcl
# The XML parser doesn't even read the document header that contains the encoding # declaration! So.. do it ourselves, then.. # proc getXmlEncoding {xml} { # The autodetection of the encoding follows # XML Recomendation, Appendix F if {![binary scan [string range $xml 0 3] "H8" firstBytes]} { # very short (< 4 Bytes) string return utf-8 } # If the entity has a XML Declaration, the first four characters # must be "<?xm". switch $firstBytes { "3c3f786d" { # UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, # EUC, or any other 7-bit, 8-bit, or mixed-width encoding which # ensures that the characters of ASCII have their normal positions, # width and values; the actual encoding declaration must be read to # detect which of these applies, but since all of these encodings # use the same bit patterns for the ASCII characters, the encoding # declaration itself be read reliably. # Try to find the end of the XML Declaration set closeIndex [string first ">" $xml] if {$closeIndex == -1} { error "Weird XML data or not XML data at all" } set xmlDeclaration [string range $xml 0 [expr {$closeIndex + 5}]] # extract the encoding information set pattern {^[^>]+encoding=[\x20\x9\xd\xa]*["']([^ "']+)['"]} # emacs: " if {![regexp $pattern $xmlDeclaration - encStr]} { # Probably something like <?xml version="1.0"?>. # Without encoding declaration this must be UTF-8 set encoding utf-8 } else { set encoding [tDOM::IANAEncoding2TclEncoding $encStr] } } "0000003c" - "0000003c" - "3c000000" - "00003c00" { # UCS-4 error "UCS-4 not supported" } "003c003f" - "3c003f00" { # UTF-16, big-endian, no BOM # UTF-16, little-endian, no BOM set encoding identity } "4c6fa794" { # EBCDIC in some flavor error "EBCDIC not supported" } default { # UTF-8 without an encoding declaration set encoding identity } } return $encoding }
There, all wacky encoding problems fixed :)