Version 10 of XML/tDOM encoding issues with the http package

Updated 2006-01-25 23:35:55

While DG was making an RSS reader/formatter for tclhttpd for dynamic page generation of news feeds, I bumped into some nasty problems with encodings and thought I'd share.

It seems that tDOM's [dom parse] command subverts the actual string you see and hands expat's XML_Parse() routine Tcl's internal string rep. I couldn't for the life of me debug why I was getting garbage characters here and there. Here's an example of a bad character:

 (Desktop) 211 % [dom parse {<?xml version="1.0" encoding="ISO-8859-1"?><foo>ä</foo>}] asXML
 <foo>��¤</foo>

So we have the character ä, which is a member of ISO-8859-1, we are claiming ISO-8859-1 in the declaration, yet we are getting garbage back(!?) If we look at other aspects of the core that manipulate encodings, we see that iso8859-1 is a pure pass-thru:

 (Desktop) 144 % proc a {str} {return [encoding convertto iso8859-1 $str]}
 (Desktop) 145 % a [a [a [a [a [a [a [a [a für]]]]]]]]
 für
 (Desktop) 146 % proc b {str} {return [encoding convertfrom iso8859-1 $str]}
 (Desktop) 147 % b [b [b [b [b [b [b [b [b für]]]]]]]]
 für

As it isn't possible to use the encoding command to get to a level lower Tcl's internal rep to fix what it's internal rep would be (is this level -1?), my fix for this had to end up in tDOM's C code. Any Tcl command that moves strings to outside APIs must externalize them with Tcl_UtfToExternalDString.

Email me for the patch if you want it. I added an '-externalizestring' to the parse command to get a loss-less mode. After I did this it all made sense:

 (Desktop) 3 % [dom parse -externalizestring {<?xml version="1.0" encoding="ISO-8859-1"?><foo>ä</foo>}] asXML
 <foo>ä</foo>

ä == ä. The string I can see is the string sent to XML_Parse(), has nothing outside the claimed range of ISO-8859-1 and I get back what I sent it. Perfectly loss-less, gain-less, and neutral.

 (Desktop) 3 % [dom parse -externalizestring [encoding convertto unicode "<?xml version=\"1.0\" encoding=\"UTF-16\"?><foo>ä</foo>"]] asXML
 <foo>ä</foo>

Even better, I see I'm turning the string into 16-bit, claiming 16-bit, and it all works. Very WSIWYG

My 'fetch' routine is the following:

 # returns the DOM object of the RSS feed.
 proc tmlrss::fetchXML {uri {recurse_limit 4}} {

    # lie like a senator for google to stop giving me a 401..
    http::config -useragent {Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5}

    set token [http::geturl $uri]
    upvar #0 $token state
    if {[http::status $token] != "ok" || [http::ncode $token] != 200} {
        # was the error a redirect?  If so, do it..
        if {[http::ncode $token] == 302 && [incr recurse_limit -1] > 0} {
            array set meta $state(meta)
            set result [fetchXML $meta(Location) $recurse_limit]
            http::cleanup $token
            return $result
        }
        set err [http::code $token]
        http::cleanup $token
        return -code error $err
    }
    set xml [http::data $token]
    array set meta $state(meta)
    http::cleanup $token

    # Do we need to let encoding conversions happen or was it already done in transit?
    if {[info exist meta(Content-Type)] && \
            [regexp -nocase {charset\s*=\s*(\S+)} $meta(Content-Type)]} {
        # socket channel encodings already performed!  So strip the XML
        # declaration, should it exist.
        return [dom parse -baseurl [uriBase $uri] [stripXmlDecl $xml]]
    } else {
        return [dom parse -externalizestring -baseurl [uriBase $uri] $xml]
    }
 }

The http package that comes with the core is very full featured, and when the Content-Type header contains a charset declaration, the stream data is converted for us. But sometimes, there won't be one, too. As the RFCs state, server Content-Type: overrides any XML declaration, so when we know the http package did a conversion, disable the XML parser from doing it a second time by removing it.

 proc tmlrss::stripXmlDecl {xml} {
    if {![binary scan [string range $xml 0 3] "H8" firstBytes]} {
        # very short (< 4 Bytes) string
        return $xml
    }

    # If the entity has an XML Declaration, the first four characters
    # must be "<?xm".
    switch $firstBytes {
        "3c3f786d" {
            # Try to find the end of the XML Declaration
            set closeIndex [string first ">" $xml]
            if {$closeIndex == -1} {
                error "Weird XML data or not XML data at all"
            }
            set xml [string range $xml [expr {$closeIndex+1}] end]
        }
        default {
            # no declaration.
        }

    }
    return $xml
 }

There, all wacky encoding problems fixed :)


Category XML