[Richard Suchenwirth] 2002-08-20 - [XML] (the eXtensible Markup Language), shortly demonstrated as textcontent is not the most beautiful or efficient way to package data, but it's fashionable and "standard". Often it's better to design a file format in XML (and rely on the available tools, e.g. [tDOM] and, based on it, [starDOM]) than to design a proprietary format, and have to write matching parsers, and document the format, etc., oneself. After all, computer memory and CPU power are getting ever cheaper these years than brain cells... Another way to represent complex data is as a nested Tcl [list]. If you just get your braces balanced, Tcl will parse a string into a list and let you process it - which gets easier with the "multi-dimensional" [lindex] and [lset] commands since Tcl 8.4. [tDOM] has the ''$node asList'' command which, after parsing the input XML string into a DOM structure in memory, returns a well-formed Tcl list with these element properties: * if the first element is ''#text'', the second is the text content; * else the first element is the tag, the second the attributes alternating ''name value...'', and the third is a list of the child elements. Comparing XML input and ''toList'' output, I thought that many differences can be handled by string manipulations in local context, for which Tcl offers powerful tools (for instance, ''[string] map, [regexp]'' and ''[regsub]''). So one morning I decided to try a direct conversion of an XML string to a string that makes a well-formed list equivalent to the ''toList'' output. The following code does some XML well-formedness checking with a stack for matching start/end tags, but besides it's of course weaker than the power of [tDOM]. Still, it was a nice little evening project, so here goes: proc xml2list xml { regsub -all {>\s*<} [string trim $xml " \n\t<>"] "\} \{" xml set xml [string map {> "\} \{#text \{" < "\}\} \{"} $xml] set res "" ;# string to collect the result set stack {} ;# track open tags set rest {} foreach item "{$xml}" { switch -regexp -- $item { ^# {append res "{[lrange $item 0 end]} " ; #text item} ^/ { regexp {/(.+)} $item -> tagname ;# end tag set expected [lindex $stack end] if {$tagname!=$expected} {error "$item != $expected"} set stack [lrange $stack 0 end-1] append res "\}\} " } /$ { # singleton - start and end in one <> group regexp {([^ ]+)( (.+))?/$} $item -> tagname - rest set rest [lrange [string map {= " "} $rest] 0 end] append res "{$tagname [list $rest] {}} " } default { set tagname [lindex $item 0] ;# start tag set rest [lrange [string map {= " "} $item] 1 end] lappend stack $tagname append res "\{$tagname [list $rest] \{" } } if {[llength $rest]%2} {error "att's not paired: $rest"} } if [llength $stack] {error "unresolved: $stack"} string map {"\} \}" "\}\}"} [lindex $res 0] } #---- Now that this went so well, I'll throw in the converse: proc list2xml list { switch -- [llength $list] { 2 {lindex $list 1} 3 { foreach {tag attributes children} $list break set res <$tag foreach {name value} $attributes { append res " $name=\"$value\"" } if [llength $children] { append res > foreach child $children { append res [list2xml $child] } append res } else {append res />} } default {error "could not parse $list"} } } #-------------------------------------------- now testing: set test {bar and} proc tdomlist x {[[dom parse $x] documentElement root] asList} ;# reference proc lequal {a b} { if {[llength $a] != [llength $b]} {return 0} if {[lindex $a 0] == $a} {return [string equal $a $b]} foreach i $a j $b {if {![lequal $i $j]} {return 0}} return 1 } proc try x { puts [set a [tdomlist $x]] puts [set b [xml2list $x]] puts list:[lequal $a $b],string:[string equal $a $b] } puts [set res [xml2list $test]] if 0 { foo {a b} {{#text {bar and}} {grill {x:c d e {f g}} {{baz {x y} {}}}} {room {} {}}} which is equal to the ''toList'' result. This may not be the most readable code I ever wrote (it strongly illustrates that unbalanced braces have to be escaped in strings ;-), but it demonstrates the mind-boggling power of hopping between the list and the string representation, which is one of Tcl's unique features. Having the two converters above, one may start to think about a Tcl list DOM variation, where accesses go via [lindex]/[lset] in the nested list, and finally well-formed XML comes out again... But this smells like more work than fun for an evening ;-) ---- RS writes: " the idea that index vectors, as can now be used with [lindex]/ [lset], are paths in a tree (which the XML or DOM is) is fascinating - navigating nested lists can go very fast with this." - Here is a little general-purpose depth-first traverser that you can run over a listDOM: proc forall {varName list body} { set $varName $list eval $body foreach child [lindex $list 2] { forall $varName $child $body } } ;# RS % forall i $a {puts [lrange $i 0 1]} foo {a b} #text {bar and} grill {x:c d e {f g}} baz {x y} room {} ---- 05Apr03 [Brian Theado] - See [http://www.cs.sfu.ca/~cameron/REX.html] for a paper describing shallow parsing of XML using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in Perl, Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the Javascript implementation apparently). The demo helped me understand what they meant by "shallow parsing". I'm guessing translation from the Perl implementation to a Tcl implementation should be pretty straightforward. ...the next day: Translation to Tcl was straightforward. See [XML Shallow Parsing with Regular Expressions]. ---- 05Jan10 [George Jempty] I tried to use this routine on a small xml file but got the following error: list element in quotes followed by "?" instead of space while executing "lrange [string map {= " "} $item] 1 end" (procedure "xml2list" line 25) invoked from within "xml2list $file_data" Input was: ---- %|[Arts and crafts of Tcl-Tk Programming]|[Category Package] |[Category XML] |[Category Parsing]|%