[Richard Suchenwirth] 2002-08-20 - [XML] (the eXtensible Markup Language),
shortly demonstrated as
textcontent
is not the most beautiful or
efficient way to package data, but it's fashionable and "standard".
Often it's better to design a file format in XML (and rely on the
available tools, e.g. [tDOM] and, based on it, [starDOM]) than to design
a proprietary format, and have to write matching parsers, and document the
format, etc., oneself. After all, computer memory and CPU power are
getting ever cheaper these years than brain cells...
Another way to represent complex data is as a nested Tcl [list]. If you just
get your braces balanced, Tcl will parse a string into a list and let you
process it - which gets easier with the "multi-dimensional" [lindex] and [lset]
commands since Tcl 8.4. [tDOM] has the ''$node asList'' command which, after
parsing the input XML string into a DOM structure in memory, returns
a well-formed Tcl list with these element properties:
* if the first element is ''#text'', the second is the text content;
* else the first element is the tag, the second the attributes alternating ''name value...'', and the third is a list of the child elements.
Comparing XML input and ''toList'' output, I thought that many
differences can be handled by string manipulations in local context, for
which Tcl offers powerful tools (for instance,
''[string] map, [regexp]'' and ''[regsub]''). So one morning I decided
to try a direct conversion of an XML string to a string that makes
a well-formed list equivalent to the ''toList'' output. The following
code does some XML well-formedness checking with a stack for matching
start/end tags, but besides it's of course weaker than the power of
[tDOM]. Still, it was a nice little evening project, so here goes:
proc xml2list xml {
regsub -all {>\s*<} [string trim $xml " \n\t<>"] "\} \{" xml
set xml [string map {> "\} \{#text \{" < "\}\} \{"} $xml]
set res "" ;# string to collect the result
set stack {} ;# track open tags
set rest {}
foreach item "{$xml}" {
switch -regexp -- $item {
^# {append res "{[lrange $item 0 end]} " ; #text item}
^/ {
regexp {/(.+)} $item -> tagname ;# end tag
set expected [lindex $stack end]
if {$tagname!=$expected} {error "$item != $expected"}
set stack [lrange $stack 0 end-1]
append res "\}\} "
}
/$ { # singleton - start and end in one <> group
regexp {([^ ]+)( (.+))?/$} $item -> tagname - rest
set rest [lrange [string map {= " "} $rest] 0 end]
append res "{$tagname [list $rest] {}} "
}
default {
set tagname [lindex $item 0] ;# start tag
set rest [lrange [string map {= " "} $item] 1 end]
lappend stack $tagname
append res "\{$tagname [list $rest] \{"
}
}
if {[llength $rest]%2} {error "att's not paired: $rest"}
}
if [llength $stack] {error "unresolved: $stack"}
string map {"\} \}" "\}\}"} [lindex $res 0]
}
#---- Now that this went so well, I'll throw in the converse:
proc list2xml list {
switch -- [llength $list] {
2 {lindex $list 1}
3 {
foreach {tag attributes children} $list break
set res <$tag
foreach {name value} $attributes {
append res " $name=\"$value\""
}
if [llength $children] {
append res >
foreach child $children {
append res [list2xml $child]
}
append res $tag>
} else {append res />}
}
default {error "could not parse $list"}
}
}
#-------------------------------------------- now testing:
set test {bar and}
proc tdomlist x {[[dom parse $x] documentElement root] asList} ;# reference
proc lequal {a b} {
if {[llength $a] != [llength $b]} {return 0}
if {[lindex $a 0] == $a} {return [string equal $a $b]}
foreach i $a j $b {if {![lequal $i $j]} {return 0}}
return 1
}
proc try x {
puts [set a [tdomlist $x]]
puts [set b [xml2list $x]]
puts list:[lequal $a $b],string:[string equal $a $b]
}
puts [set res [xml2list $test]]
if 0 {
foo {a b} {{#text {bar and}} {grill {x:c d e {f g}} {{baz {x y} {}}}} {room {} {}}}
which is equal to the ''toList'' result. This may not be the most
readable code I ever wrote (it strongly illustrates that unbalanced
braces have to be escaped in strings ;-), but it demonstrates the
mind-boggling power of hopping between the list and the string
representation, which is one of Tcl's unique features.
Having the two converters above, one may start to think about a Tcl list DOM variation, where accesses go via [lindex]/[lset] in the nested list, and finally well-formed XML comes out again... But this smells like more work than fun for an evening ;-)
----
RS writes:
" the idea that index vectors, as can now be used with [lindex]/ [lset], are paths in a tree (which the XML or DOM is) is fascinating - navigating nested lists can go very fast with this." - Here is a little general-purpose depth-first traverser that you can run over a listDOM:
proc forall {varName list body} {
set $varName $list
eval $body
foreach child [lindex $list 2] {
forall $varName $child $body
}
} ;# RS
% forall i $a {puts [lrange $i 0 1]}
foo {a b}
#text {bar and}
grill {x:c d e {f g}}
baz {x y}
room {}
----
05Apr03 [Brian Theado] - See [http://www.cs.sfu.ca/~cameron/REX.html] for a paper describing shallow parsing of XML using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in Perl, Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the Javascript implementation apparently). The demo helped me understand what they meant by "shallow parsing".
I'm guessing translation from the Perl implementation to a Tcl implementation should be pretty straightforward. ...the next day: Translation to Tcl was straightforward. See [XML Shallow Parsing with Regular Expressions].
----
05Jan10 [George Jempty] I tried to use this routine on a small xml file but got the following error:
list element in quotes followed by "?" instead of space
while executing
"lrange [string map {= " "} $item] 1 end"
(procedure "xml2list" line 25)
invoked from within
"xml2list $file_data"
Input was:
----
%|[Arts and crafts of Tcl-Tk Programming]|[Category Package] |[Category XML] |[Category Parsing]|%