TAX was inspired by [Stephen Uhler's HTML parser in 10 lines]. In fact, the code is almost exactly the same. Just a couple of extra bells and whistles. TAX, the Tiny API for [XML], is vaguely similar to [SAX] in that for both TAX & SAX, XML is handled by defining a handler for tags. Otherwise, TAX & SAX have more differences than similarities. The most important difference is that with TAX, both the XML and the processed XML reside in memory, so TAX makes inefficient use of memory. For this reason, it's best for small XML files. In contrast, SAX is an event-driven parser that operates on a stream, so the XML doesn't have to all be loaded into memory. (Of course, another important difference is that [SAX] is a mature, full-featured, well-documented, and well-supported API.) As with Stephen Uhler's gem, TAX takes an XML file and converts it into a Tcl script. Tags become procs. The XML is then executed by [eval]'ing the script. Here's the essential code: ############################################################ # # Based heavily on Stephen Uhler's HTML parser in 10 lines # Modified by Eric Kemp-Benedict for XML # # Turn XML into TCL commands # xml A string containing an html document # cmd A command to run for each html tag found # start The name of the dummy html start/stop tags # # Namespace "tax" stands for "Tiny API for XML" # namespace eval tax {} proc tax::parse {cmd xml {start docstart}} { regsub -all { $xml {&ob;} xml regsub -all } $xml {&cb;} xml set exp {<(/?)([^s/>]+)s*([^/>]*)(/?)>} set sub "} $cmd {\2} [expr {{\1} ne ""}] [expr {{\4} ne ""}] [regsub -all -- {\s+|(\s*=\s*)} {\3} " "] {" regsub -all $exp $xml $sub xml eval "$cmd {$start} 0 0 {} { $xml }" eval "$cmd {$start} 1 1 {} {}" } To use it, create a parser command, ''cmd'', that will handle any tag found in the string ''xml''. The parser calls cmd in the following way: cmd tag cl selfcl props body where * ''tag'' is the tag (e.g., p, br, h1, etc. from HTML) or the special tag "docstart" * ''cl'' is a boolean saying if this is a closing tag (e.g., like

) * ''selfcl'' is a boolean saying if this is a self-closing tag (e.g.,
for XHTML) * ''props'' is a list of name/value pairs that can be passed to an array using [array set] * ''body'' is text following the tag that is not enclosed in a tag (e.g., for

My text

, "My text" is the body) Here's an example of use (that also uses [snit] to build the parser -- there's one snit method for each tag). package require snit ############################################################ # # Based heavily on Stephen Uhler's HTML parser in 10 lines # Modified by Eric Kemp-Benedict for XML # # Turn XML into TCL commands # xml A string containing an html document # cmd A command to run for each html tag found # start The name of the dummy html start/stop tags # # Namespace "tax" stands for "Tiny API for XML" # namespace eval tax {} proc tax::parse {cmd xml {start docstart}} { regsub -all { $xml {&ob;} xml regsub -all } $xml {&cb;} xml set exp {<(/?)([^s/>]+)s*([^/>]*)(/?)>} set sub "} $cmd {\2} [expr {{\1} ne ""}] [expr {{\4} ne ""}] [regsub -all -- {\s+|(\s*=\s*)} {\3} " "] {" regsub -all $exp $xml $sub xml eval "$cmd {$start} 0 0 {} { $xml }" eval "$cmd {$start} 1 0 {} {}" } snit::type parser { proc compactws {s} { return [regsub -all -- {s+} [string trim $s] " "] } method docstart {cl args} { if $cl { puts "...End document" } else { puts "Start document..." } } method para {cl selfcl props body} { array set temp $props if {!$cl} { set outstring [compactws $body] if [info exists temp(indent)] { set outstring "[string repeat { } $temp(indent)]$outstring" } puts $outstring } } method meta {cl selfcl props body} { array set temp $props foreach item [array names temp] { puts "[string totitle $item]: $temp($item)" } if {!$selfcl} { puts [compactws $body] } else { puts "" } } } parser myparser tax::parse myparser { Composed in haste for purposes of demonstration. This is an indented paragraph. Only the first line is indented, which you can tell if the paragraph goes on long enough. This is an ordinary paragraph. No line is indented. Not one. None at all, which you can tell if the paragraph goes on long enough. } It gives this output: Start document... Author: Anne Onymous Composed in haste for purposes of demonstration. This is an indented paragraph. Only the first line is indented, which you can tell if the paragraph goes on long enough. This is an ordinary paragraph. No line is indented. Not one. None at all, which you can tell if the paragraph goes on long enough. ...End document ---- [EKB] wrote this, right? ''[EKB] Yes, I admit it!'' ---- [EF] A slightly revisited version of this, with support for accessing the XML tag tree from within the callback is available at [TAX Revisited]. [EKB] - I looked at TAX Revisited and it's nifty! I was just keeping track of state with my own variables; it's nice to have it done for you. I also agree that handling self-closing tags separately from normal tags is awkward: and should look identical. [EF] I have made the revisited implementation [http://wiki.tcl.tk/15293] part of the [TIL]. I hope that you don't mind. The source makes full mention to the Wiki reference. [EKB] I don't mind at all. Great! ---- replace those regsubs with string map and you'll probably see a performance boost... [EKB] Are they all replaceable? It looks like: regsub -all { $xml {&ob;} xml regsub -all } $xml {&cb;} xml can be replaced by set xml [string map {{ &ob; } \%cb;} $xml] but I ''think'' the others have to be regexps. ---- [Category Internet] | [Category Word and Text Processing] | [Category XML]