[Richard Suchenwirth] 2002-08-20 - XML (the eXtensible Markup Language),
shortly demonstrated as
textcontent
is not the most beautiful or
efficient way to package data, but it's fashionable and "standard".
Often it's better to design a file format in XML (and rely on the
available tools, e.g. [tDOM] and, based on it, [starDOM]) than to design
a proprietary format, and have to write matching parsers, and document the
format, etc., oneself. After all, computer memory and CPU power are
getting ever cheaper these years than brain cells...
Another way to represent complex data is as a nested Tcl [list]. If you just
get your braces balanced, Tcl will parse a string into a list and let you
process it - which gets easier with the "multi-dimensional" [lindex] and [lset]
commands since Tcl 8.4. [tDOM] has the ''$node asList'' command which, after
parsing the input XML string into a DOM structure in memory, returns
a well-formed Tcl list with these element properties:
* if the first element is ''#text'', the second is the text content;
* else the first element is the tag, the second the attributes alternating ''name value...'', and the third is a list of the child elements.
Comparing XML input and ''toList'' output, I thought that many
differences can be handled by string manipulations in local context, for
which Tcl offers powerful tools (for instance,
''[string] map, [regexp]'' and ''[regsub]''). So one morning I decided
to try a direct conversion of an XML string to a string that makes
a well-formed list equivalent to the ''toList'' output. The following
code does some XML well-formedness checking with a stack for matching
start/end tags, but besides it's of course weaker than the power of
[tDOM]. Still, it was a nice little evening project, so here goes:
proc xml2list xml {
regsub -all {>\s*<} [string trim $xml " \n\t<>"] "\} \{" xml
set xml [string map {> "\} \{#text \{" < "\}\} \{"} $xml]
set res "" ;# string to collect the result
set stack {} ;# track open tags
set rest {}
set braces2close 0
foreach item "{$xml}" {
switch -regexp -- $item {
^# {append res "{[lrange $item 0 end]} " ; #text item}
^/ {
regexp {/(.+)} $item -> tagname ;# end tag
set expected [lindex $stack end]
if {$tagname!=$expected} {error "$item != $expected"}
set stack [lrange $stack 0 end-1]
append res "\}\} "
incr braces2close -2
}
/$ { # singleton - start and end in one <> group
regexp {([^ ]+)( (.+))?/$} $item -> tagname - rest
set rest [lrange [string map {= " "} $rest] 0 end]
append res "{$tagname [list $rest] {}} "
}
default {
set tagname [lindex $item 0] ;# start tag
set rest [lrange [string map {= " "} $item] 1 end]
lappend stack $tagname
append res "\{$tagname [list $rest] \{"
incr braces2close 2
}
}
if {[llength $rest]%2} {error "att's not paired: $rest"}
}
if [llength $stack] {error "unresolved: $stack"}
append res [string repeat "\}" $braces2close]
string map {"\} \}" "\}\}"} [lindex $res 0]
}
#---- Now that this went so well, I'll throw in the converse:
proc list2xml list {
switch -- [llength $list] {
2 {lindex $list 1}
3 {
foreach {tag attributes children} $list break
set res <$tag
foreach {name value} $attributes {
append res " $name=\"$value\""
}
if [llength $children] {
append res >
foreach child $children {
append res [list2xml $child]
}
append res $tag>
} else {append res />}
}
default {error "could not parse $list"}
}
}
#-------------------------------------------- now testing:
set test {bar and}
proc tdomlist x {[[dom parse $x] documentElement root] asList} ;# reference
proc lequal {a b} {
if {[llength $a] != [llength $b]} {return 0}
if {[lindex $a 0] == $a} {return [string equal $a $b]}
foreach i $a j $b {if {![lequal $i $j]} {return 0}}
return 1
}
proc try x {
puts [set a [tdomlist $x]]
puts [set b [xml2list $x]]
puts list:[lequal $a $b],string:[string equal $a $b]
}
puts [set res [xml2list $test]]
if 0 {
foo {a b} {{#text {bar and}} {grill {x:c d e {f g}} {} {baz {x y} {}}}} {room {} {}}
which is equal to the ''toList'' result. This may not be the most
readable code I ever wrote (it strongly illustrates that unbalanced
braces have to be escaped in strings ;-), but it demonstrates the
mind-boggling power of hopping between the list and the string
representation, which is one of Tcl's unique features.
Having the two converters above, one may start to think about a Tcl list DOM variation, where accesses go via [lindex]/[lset] in the nested list, and finally well-formed XML comes out again... But this smells like more work than fun for an evening ;-)
----
[Arts and crafts of Tcl-Tk programming]
}