While [DG] was making an [RSS] reader/formatter for [tclhttpd] for dynamic page generation of news feeds, I bumped into some nasty problems with encodings and thought I'd share.
It seems that [tDOM]'s [[dom parse]] command subverts the actual string you see
and hands expat's XML_Parse() routine Tcl's internal string rep.
I couldn't for the life of me debug why I was getting garbage characters here and there.
Here's an example of a bad character:
(Desktop) 211 % [dom parse {ä}] asXML
��¤
So we have the character ä, which is a member of ISO-8859-1, we are claiming ISO-8859-1
in the declaration, yet we are getting garbage back(!?) If we look at other aspects
of the core that manipulate encodings, we see that iso8859-1 is a pure pass-thru:
(Desktop) 144 % proc a {str} {return [encoding convertto iso8859-1 $str]}
(Desktop) 145 % a [a [a [a [a [a [a [a [a für]]]]]]]]
für
(Desktop) 146 % proc b {str} {return [encoding convertfrom iso8859-1 $str]}
(Desktop) 147 % b [b [b [b [b [b [b [b [b für]]]]]]]]
für
As it isn't possible to use the [encoding] command to get to a level lower Tcl's internal rep
to fix what it's internal rep would be ('''is this level -1?'''), my fix for this had to end up
in tDOM's C code. Any Tcl command that moves strings to outside APIs must externalize them with [Tcl_UtfToExternalDString].
Email me for the patch if you want it. I added an '-externalizestring' to the parse command
to get a loss-less mode. After I did this it all made sense:
(Desktop) 3 % [dom parse -externalizestring {ä}] asXML
ä
ä == ä. The string I can see is the string sent to XML_Parse(), has nothing outside the claimed range of ISO-8859-1 and I get back what I sent it. Perfectly loss-less, gain-less, and neutral.
(Desktop) 3 % [dom parse -externalizestring [encoding convertto unicode "ä"]] asXML
ä
Even better, I see I'm turning the string into 16-bit, claiming 16-bit, and it all works without me having to consider how Tcl manages the bits of a string object internally. Very [WYSIWYG] as what I see is what XML_Parse gets! Ahh.. the purity of all :)
My 'fetch' routine is the following:
# returns the DOM object of the RSS feed.
proc tmlrss::fetchXML {uri {recurse_limit 4}} {
# lie like a senator for google to stop giving me a 401..
http::config -useragent {Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5}
set token [http::geturl $uri]
upvar #0 $token state
if {[http::status $token] != "ok" || [http::ncode $token] != 200} {
# was the error a redirect? If so, do it..
if {[http::ncode $token] == 302 && [incr recurse_limit -1] > 0} {
array set meta $state(meta)
set result [fetchXML $meta(Location) $recurse_limit]
http::cleanup $token
return $result
}
set err [http::code $token]
http::cleanup $token
return -code error $err
}
set xml [http::data $token]
array set meta $state(meta)
http::cleanup $token
# Do we need to let encoding conversions happen or was it already done in transit?
if {[info exist meta(Content-Type)] && \
[regexp -nocase {charset\s*=\s*(\S+)} $meta(Content-Type)]} {
# socket channel encodings already performed! So strip the XML
# declaration, should it exist.
return [dom parse -baseurl [uriBase $uri] [stripXmlDecl $xml]]
} else {
return [dom parse -externalizestring -baseurl [uriBase $uri] $xml]
}
}
The http package that comes with the core is very full featured, and when the
Content-Type header contains a charset declaration, the stream data is converted for us.
But sometimes, there won't be one, too.
As the RFCs state, server Content-Type: overrides any XML declaration,
so when we know the http package did a conversion, disable the XML parser
from doing it a second time by removing it.
proc tmlrss::stripXmlDecl {xml} {
if {![binary scan [string range $xml 0 3] "H8" firstBytes]} {
# very short (< 4 Bytes) string
return $xml
}
# If the entity has an XML Declaration, the first four characters
# must be "" $xml]
if {$closeIndex == -1} {
error "Weird XML data or not XML data at all"
}
set xml [string range $xml [expr {$closeIndex+1}] end]
}
default {
# no declaration.
}
}
return $xml
}
There, all wacky encoding problems fixed :)
----
[Category XML]