Version 10 of Regexp HTML Attribute Parsing

Updated 2005-06-02 15:42:37 by NEM

20040711 CMcC: here's a little HTML/XML/SGML attribute parser. It's iterative, but it uses regexps extensively.

    array set match {
        quote {^([a-zA-Z0-9_-]+)[ \t]*=[ \t]*["]([^"]+)["][ \t]*(.*)$}
            squote {^([a-zA-Z0-9_-]+)[ \t]*=[ \t]*[']([^']+)['][ \t]*(.*)$}
            uquote {^([a-zA-Z0-9_-]+)[ \t]*=[ \t]*([^ \t'"]+)[ \t]*(.*)$} 
        }

    proc parseAttr {astring} {
        global match
        array set attr {}
        set astring [string trim $astring]
        if {$astring eq ""} {
            return {}
        }

        while {$astring != ""} {
            foreach m {quote squote uquote} {
            set org $astring
                if {[regexp $match($m) $astring all var val suffix]} {
                    set attr($var) $val
                    set astring [string trimleft $suffix]
                }
            }
        if {$astring == $org} {
            error "parseAttr: can't parse $astring - not a properly formed attribute string"
        }

        }
        return [array get attr]
    }

Since you are considering the dark side of markup parsing, you might also enjoy XML Shallow Parsing with Regular Expressions

LES: Regex are too often criticized by those who just don't know or like them. "If only you knew the power of the dark side..."

NEM: Regular Expressions Are Not A Good Idea for Parsing XML, HTML, or e-mail Addresses. Regular expressions can be immensely useful -- I use them frequently for pulling apart simple (regular) strings. However, there are genuine limits to the power of regexps, and people should be aware of them. Especially for situations (such as parsing XML/HTML) where there exist (several) excellent quality full parsers.


Tcllib also contains a module, htmlparse, for parsing HTML code.


[ Category HTML | Category XML ]