While I was making an [RSS] reader/formatter for [tclhttpd] for dynamic page generation of news feeds, I bumped into some nasty problems with encodings and thought I'd share. My 'fetch' routine is the following: # returns the DOM object of the RSS feed. proc fetchXML {uri {recurse_limit 4}} { # Lie like a senator for google to stop giving me a 401.. http::config -useragent "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5" set token [http::geturl $uri] upvar #0 $token state if {[http::status $token] != "ok" || [http::ncode $token] != 200} { # was the error a redirect? If so, do it.. if {[http::ncode $token] == 302 && [incr recurse_limit -1] > 0} { array set meta $state(meta) set result [fetchXML $meta(Location) $recurse_limit] http::cleanup $token return $result } set err [http::code $token] http::cleanup $token return -code error $err } set xml [http::data $token] array set meta $state(meta) http::cleanup $token # Do we need to do an encoding translation or was it already done? # -=[SEE NOTE BELOW]=- if {[info exist meta(Content-Type)] && \ [regexp -nocase {charset\s*=\s*(\S+)} $meta(Content-Type)]} { # socket channel encodings already performed! No additional work needed. } else { # manually perform encoding translations using XML header info. set xml [encoding convertfrom [tmlrss::getXmlEncoding $xml] $xml] } return [dom parse -baseurl [uriBase $uri] $xml] } The http package that comes with the core is very full featured, and when the Content-Type header contains a charset declaration, the stream data is converted for us. But sometimes, there won't be one, too. The [[dom parse]] command for [tDOM] doesn't read the XML document header for charset either, so I made the following based on tDOM's [[tDOM::xmlOpenFile]] found in the package's tdom.tcl # The XML parser doesn't even read the document header that contains the encoding # declaration! So.. do it ourselves, then.. # proc getXmlEncoding {xml} { # The autodetection of the encoding follows # XML Recomendation, Appendix F if {![binary scan [string range $xml 0 3] "H8" firstBytes]} { # very short (< 4 Bytes) string return utf-8 } # If the entity has a XML Declaration, the first four characters # must be "" $xml] if {$closeIndex == -1} { error "Weird XML data or not XML data at all" } set xmlDeclaration [string range $xml 0 [expr {$closeIndex + 5}]] # extract the encoding information set pattern {^[^>]+encoding=[\x20\x9\xd\xa]*["']([^ "']+)['"]} # emacs: " if {![regexp $pattern $xmlDeclaration - encStr]} { # Probably something like . # Without encoding declaration this must be UTF-8 set encoding utf-8 } else { set encoding [tDOM::IANAEncoding2TclEncoding $encStr] } } "0000003c" - "0000003c" - "3c000000" - "00003c00" { # UCS-4 error "UCS-4 not supported" } "003c003f" - "3c003f00" { # UTF-16, big-endian, no BOM # UTF-16, little-endian, no BOM set encoding identity } "4c6fa794" { # EBCDIC in some flavor error "EBCDIC not supported" } default { # UTF-8 without an encoding declaration set encoding identity } } return $encoding } There, all wacky encoding problems fixed :)