[Keith Vetter] 2004-03-01 : Here's yet another way to parse an [XML] or HTML file. See also [Parsing HTML], [A little XML parser], [XML Shallow Parsing with Regular Expressions], [Playing SAX] and [Regular Expressions Are Not A Good Idea for Parsing XML, HTML, or e-mail Addresses]. This one, however, is written in pure tcl without needing any extensions. It probably doesn't handle all the XML corner cases but it's worked on all the valid XML I've thrown at it--including handling CDATA data. It's a SAX-like interface where every call to it returns three values: '''type''', '''value''', and '''attr''' where ''type'' is one of "XML", "TXT" or "EOF"; ''value'' is either the xml entity value or the entities' text; and ''attr'' is the value of any attributes associated with the current XML entity. ---- namespace eval ::XML { variable XML "" loc 0} proc ::XML::Init {xmlData} { variable XML variable loc set XML [string trim $xmlData]; regsub -all {} $XML {} XML ;# Remove all comments set loc 0 } proc ::XML::NextToken {{peek 0}} { variable XML variable loc set n [regexp -start $loc -indices {(.*?)\s*?<(.*?)/?>} $XML all txt tok] if {! $n} {return [list EOF "" ""]} foreach {all0 all1} $all {txt0 txt1} $txt {tok0 tok1} $tok break if {$txt1 >= $txt0} { ;# Got text set txt [string range $XML $txt0 $txt1] if {! $peek} {set loc [expr {$txt1 + 1}]} return [list TXT $txt ""] } set token [string range $XML $tok0 $tok1] ;# Got something in brackets if {! $peek} {set loc [expr {$all1 + 1}]} if {[regexp {^!\[CDATA\[(.*)\]\]} $token => txt]} { ;# Is it CDATA stuff? return [list TXT $txt ""] } set attr "" regexp {^(.*?)\s+(.*?)$} $token => token attr return [list XML $token $attr] } # Demo code set xml { Geocache http://www.geocaching.com/seek/cache_details.aspx?wp=GCGPXK Geocache http://www.geocaching.com/seek/cache_details.aspx?wp=GC19DF } ::XML::Init $xml while {1} { foreach {type val attr} [::XML::NextToken] break puts "looking at: $type '$val' '$attr'" if {$type == "EOF"} break } ---- [Steve Ball]: Comment #1: This looks like an xmlTextReader-style interface (ie. looping and reading one token at a time). That's interesting to me because I'm now using libxml2's xmlTextReader interface in TclXML and am considering introducing it at the scripting level. Comment #2: Your code above will work fine as long as the input XML is well-formed. It has absolutely no error checking at all! Error checking is where all the hard work is... [Category XML] | [Category Internet] | [Category Package]