Version 2 of XML/tDOM encoding issues with the http package

Updated 2006-01-24 10:36:00

While I was making an RSS reader/formatter for tclhttpd for dynamic page generation of news feeds, I bumped into some nasty problems with encodings and thought I'd share.

My 'fetch' routine is the following:

 # returns the DOM object of the RSS feed.
 proc fetchXML {uri {recurse_limit 4}} {

    # Lie like a senator for google to stop giving me a 401..
    http::config -useragent "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5"

    set token [http::geturl $uri]
    upvar #0 $token state

    if {[http::status $token] != "ok" || [http::ncode $token] != 200} {
        # was the error a redirect?  If so, do it..
        if {[http::ncode $token] == 302 && [incr recurse_limit -1] > 0} {
            array set meta $state(meta)
            set result [fetchXML $meta(Location) $recurse_limit]
            http::cleanup $token
            return $result
        }
        set err [http::code $token]
        http::cleanup $token
        return -code error $err
    }
    set xml [http::data $token]
    array set meta $state(meta)
    http::cleanup $token

    # Do we need to do an encoding translation or was it already done?
    # -=[SEE NOTE BELOW]=-
    if {[info exist meta(Content-Type)] && \
            [regexp -nocase {charset\s*=\s*(\S+)} $meta(Content-Type)]} {
        # socket channel encodings already performed! No additional work needed.
    } else {
        # manually perform encoding translations using XML header info.
        set xml [encoding convertfrom [tmlrss::getXmlEncoding $xml] $xml]
    }

    return [dom parse -baseurl [uriBase $uri] $xml]
 }

The http package that comes with the core is very full featured, and when the Content-Type header contains a charset declaration, the stream data is converted for us. But sometimes, there won't be one, too.

The [dom parse] command for tDOM doesn't read the XML document header for charset either, so I made the following based on tDOM's [tDOM::xmlOpenFile] found in the package's tdom.tcl

 # The XML parser doesn't even read the document header that contains the encoding
 # declaration!  So..  do it ourselves, then..
 #
 proc getXmlEncoding {xml} {

    # The autodetection of the encoding follows
    # XML Recomendation, Appendix F

    if {![binary scan [string range $xml 0 3] "H8" firstBytes]} {
        # very short (< 4 Bytes) string
        return utf-8
    }

    # If the entity has a XML Declaration, the first four characters
    # must be "<?xm".
    switch $firstBytes {
        "3c3f786d" {
            # UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS,
            # EUC, or any other 7-bit, 8-bit, or mixed-width encoding which 
            # ensures that the characters of ASCII have their normal positions,
            # width and values; the actual encoding declaration must be read to
            # detect which of these applies, but since all of these encodings
            # use the same bit patterns for the ASCII characters, the encoding
            # declaration itself be read reliably.

            # Try to find the end of the XML Declaration
            set closeIndex [string first ">" $xml]
            if {$closeIndex == -1} {
                error "Weird XML data or not XML data at all"
            }

            set xmlDeclaration [string range $xml 0 [expr {$closeIndex + 5}]]
            # extract the encoding information
            set pattern {^[^>]+encoding=[\x20\x9\xd\xa]*["']([^ "']+)['"]}
            # emacs: "
            if {![regexp $pattern $xmlDeclaration - encStr]} {
                # Probably something like <?xml version="1.0"?>. 
                # Without encoding declaration this must be UTF-8
                set encoding utf-8
            } else {
                set encoding [tDOM::IANAEncoding2TclEncoding $encStr]
            }
        }
        "0000003c" -
        "0000003c" -
        "3c000000" -
        "00003c00" {
            # UCS-4
            error "UCS-4 not supported"
        }
        "003c003f" -
        "3c003f00" {
            # UTF-16, big-endian, no BOM
            # UTF-16, little-endian, no BOM
            set encoding identity
        }
        "4c6fa794" {
            # EBCDIC in some flavor
            error "EBCDIC not supported"
        }
        default {
            # UTF-8 without an encoding declaration
            set encoding identity
        }
    }
    return $encoding
 }

There, all wacky encoding problems fixed :)