Parsing XML

Yet another way to parse an XML or HTML file.

Description

Keith Vetter 2004-03-01:

This one, however, is written in pure Tcl without needing any extensions. It probably doesn't handle all the XML corner cases but it's worked on all the valid XML I've thrown at it--including handling CDATA data.

It's a SAX-like interface where every call to it returns four values: type, value, attr, and etype where type is one of "XML", "PI", "TXT" or "EOF"; value is either the xml entity value or the entities' text; attr is the value of any attributes associated with the current XML entity; and etype is the type of entity--either "START", "END" or "EMPTY".

JE: Just took a cursory glance at this code and spotted three bugs already. Parsing XML with regexps is hard. This routine might be OK to use if avoiding external dependencies is more important than correctness, but if you want to do things right you're better off with tDOM or TclXML.

KPV: Curious, what kinds of valid input will cause the parser to fail?

DKF: Try this:

<?xml version="1.0" encoding="utf-8"?>
<element><![CDATA[<not.an.element/>]]></element>

PYK 2021-03-05: The parser also fails on the occurrence of > in the value of an attribute:

<?xml version="1.0" encoding="utf-8"?>
<element attribute1="some > value" ></element>

############################################################################
#
# xml.tcl -- Simple XML parser
# by Keith Vetter, March 2004
#

namespace eval ::XML {variable XML "" loc 0}


proc ::XML::Init {xmlData} {
    variable XML
    variable loc

    set XML [string trim $xmlData];
    # Remove all comments
    regsub -all {<!--.*?-->} $XML {} XML
    set loc 0
}


# Returns {XML|TXT|EOF|PI value attributes START|END|EMPTY}
proc ::XML::NextToken {{peek 0}} {
    variable XML
    variable loc

    set n [regexp -start $loc -indices {(.*?)\s*?<(/?)(.*?)(/?)>} \
        $XML all txt stok tok etok]
    if {! $n} {return [list EOF]}
    foreach {all0 all1} $all {txt0 txt1} $txt \
        {stok0 stok1} $stok {tok0 tok1} $tok {etok0 etok1} $etok break

    if {$txt1 >= $txt0} {
        ;# Got text
        set txt [string range $XML $txt0 $txt1]
        if {! $peek} {set loc [expr {$txt1 + 1}]}
        return [list TXT $txt]
    }

    # Got something in brackets
    set token [string range $XML $tok0 $tok1]
    if {! $peek} {set loc [expr {$all1 + 1}]}
    if {[regexp {^!\[CDATA\[(.*)\]\]} $token => txt]} {
         # Is it CDATA stuff?
         return [list TXT $txt]
    }

    # Check for Processing Instruction <?...?>
    set type XML
    if {[regexp {^\?(.*)\?$} $token => token]} {
         set type PI
    }
    set attr ""
    regexp {^(.*?)\s+(.*?)$} $token => token attr

    # Entity type
    set etype START
    if {$etok0 <= $etok1} {
       if {$stok0 <= $stok1} {
           # Bad XML
           set token /$token
       }
       set etype EMPTY
    } elseif {$stok0 <= $stok1} {
         set etype END
    }
    return [list $type $token $attr $etype]
}


# ::XML::IsWellFormed
#  checks if the XML is well-formed )http://www.w3.org/TR/1998/REC-xml-19980210)
#
# Returns "" if well-formed, error message otherwise
# missing:
#  characters: doesn't check valid extended characters
#  attributes: doesn't check anything: quotes, equals, unique, etc.
#  text stuff: references, entities, parameters, etc.
#  doctype internal stuff
#  
proc ::XML::IsWellFormed {} {
    set result [::XML::_IsWellFormed]
    set ::XML::loc 0
    return $result
}


proc ::XML::_IsWellFormed {} {
    array set emsg {
        XMLDECLFIRST {The XML declaration must come first}
        MULTIDOCTYPE {Only one DOCTYPE is allowed}
        INVALID {Invalid document structure}
        MISMATCH "Ending tag '$val' doesn't match starting tag"
        BADELEMENT "Bad element name '$val'"
        EOD {Only processing instructions allowed at end of document}
        BADNAME "Bad name '$val'"
        BADPI {No processing instruction starts with 'xml'}
    }
    
    # [1] document ::= prolog element Misc*
    # [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
    # [27] Misc ::= Comment | PI | S
    # [28] doctypedecl ::= <!DOCTYPE...>
    # [16] PI ::= <? Name ...?>

    # 1 xml, 2 pi, 4 doctype
    set seen 0
    while 1 {
        foreach {type val attr etype} [::XML::NextToken] break
        if {$type eq {PI}} {
            if {![regexp {^[a-zA-Z_:][a-zA-Z0-9.-_:\xB7]+$} $val]} {
                return [subst $emsg(BADNAME)]
            }
            if {$val eq {xml}} {
                ;# XMLDecl
                if {$seen != 0} { return $emsg(XMLDECLFIRST) }
                  # TODO: check version number exist and only encoding and
                  # standalone attributes are allowed

                  # Mark as seen XMLDecl
                  incr seen
                  continue
            }
            if {[string equal -nocase xml $val]} {
                return $emsg(BADPI)
            }
            # Mark as seen PI
            set seen [expr {$seen | 2}]
            continue
        } elseif {$type eq {XML} && $val eq {!DOCTYPE}} {
            ;# Doctype
            if {$seen & 4} {
                return $emsg(MULTIDOCTYPE)
            }
            set seen [expr {$seen | 4}]
            continue
        }
        break
    }

    # [39] element ::= EmptyElemTag | STag content ETag
    # [40] STag ::= < Name (S Attribute)* S? >
    # [42] ETag ::= </ Name S? >
    # [43] content ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*
    # [44] EmptyElemTag ::= < Name (S Attribute)* S? />
    # 

    set stack {}
    set first 1
    while 1 {
        if {! $first} {
            # Skip first time in
            foreach {type val attr etype} [::XML::NextToken] break
        } else {
            if {$type ne {XML} && $type ne {EOF}} {
                return $emsg(INVALID)
            }
            set first 0
        }

        if {$type eq {EOF}} break
        # TODO: check attributes: quotes, equals and unique

        if {$type eq {TXT}} continue
        if {! [regexp {^[a-zA-Z_:][a-zA-Z0-9.-_:\xB7]+$} $val]} {
            return [subst $emsg(BADNAME)]
        }

        if {$type eq {PI}} {
            if {[string equal -nocase xml $val]} {
                 return $emsg(BADPI)
            }
            continue
        }
        if {$etype eq {START}} {
            # Starting tag
            lappend stack $val
        } elseif {$etype eq {END}} {
            # </tag>
            if {$val ne [lindex $stack end]} { return [subst $emsg(MISMATCH)] }
            set stack [lrange $stack 0 end-1]
            if {[llength $stack] == 0} {
                # Empty stack
                break
            }
        } elseif {$etype eq {EMPTY}} {
            # <tag/>
        }
    }

    # End-of-Document can only contain processing instructions
    while 1 {
        foreach {type val attr etype} [::XML::NextToken] break
        if {$type eq {EOF}} break
        if {$type eq {PI}} {
            if {[string equal -nocase xml $val]} { return $emsg(BADPI) }
            continue
        }
        return $emsg(EOD)
    }
    return {} 
}


################################################################
#
# Demo code
#
set xml {<?xml version="1.0" encoding="ISO-8859-1"?>
    <loc version="1.0" src="Groundspeak">
    <waypoint>
    <name id="GCGPXK"><![CDATA[Playing Poker with the Squirrels by Rino 'n Rinette]]></name>
    <coord lat="40.1548166" lon="-82.5202833"/>
    <type>Geocache</type>
    <link text="Cache Details">http://www.geocaching.com/seek/cache_details.aspx?wp=GCGPXK</link>
    </waypoint><waypoint>
    <name id="GC19DF"><![CDATA[Great Playground Caper by Treasure Hunters Inc.]]></name>
    <coord lat="40.0667166666667" lon="-82.5358"/>
    <type>Geocache</type>
    <link text="Cache Details">http://www.geocaching.com/seek/cache_details.aspx?wp=GC19DF</link>
    </waypoint>
    </loc>
}


::XML::Init $xml
set wellFormed [::XML::IsWellFormed]
if {$wellFormed ne ""} {
    puts "The xml is not well-formed: $wellFormed"
} else {
    puts "The xml is well-formed"
    while {1} {
        foreach {type val attr etype} [::XML::NextToken] break
        puts "looking at: $type '$val' '$attr' '$etype'"
        if {$type == "EOF"} break
    }
}

Steve Ball: Comment #1: This looks like an xmlTextReader-style interface (ie. looping and reading one token at a time). That's interesting to me because I'm now using libxml2's xmlTextReader interface in TclXML and am considering introducing it at the scripting level.

Comment #2: Your code above will work fine as long as the input XML is well-formed. It has absolutely no error checking at all! Error checking is where all the hard work is...

KPV: True on both accounts. I was working with validated XML so error checking was not important. But what do you expect from 25 lines of code?

JE: This is a reasonable approach to take. It's OK for a simple parser to omit error checking as long as you only feed it XML that's known to be good; in the context of a complete system, the well-formedness checks and validation can be Somebody Else's Problem. That said, the above code fails on valid input too.

MSW: It's tcl, so I expect a fully featured all-in-one device suitable for every purpose (aka ``eierlegende Wollmilchsau'' :)


KPV: In response to comment #2 above, I added a routine to check for well-formedness. It checks for most of the major well-formed constraints such as properly nesting tags (that's easy), proper prolog, valid names, etc. It doesn't handle: names with extended unicode characters, anything to do with attributes (quotes, unique, character sets, etc.), text stuff like references, entities, parameters, etc., nor doctype internal stuff.

Now I remember how much I hate XML. At first blush, XML seems so simple, but once you dig deep there's so many weird gotchas. Checking for well-formedness quadrupled the size of the code and even so it's not complete. I love XML's design goal #6 XML documents should be human-legible and reasonably clear [L1 ].


AM 2009-09-08: Once you have parsed an XML file successfully, you can regard it as another way to specify source code ... Using XML files for source code