Version 9 of Stephen Uhler's HTML parser in 10 lines

Updated 2008-07-18 08:50:01 by lars_h

EKB This is a follow-up to the discussion on Is Tcl Different!.

Here's the wonderful HTML parser in 10 lines as posted on that page, and originally suggested by Stephen Uhler:


    ############################################
    # Turn HTML into TCL commands
    #   html    A string containing an html document
    #   cmd                A command to run for each html tag found
    #   start        The name of the dummy html start/stop tags

    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set w " \t\r\n"        ;# white space
        proc HMcl x {return "\[$x\]"}
        set exp <(/?)([HMcl ^$w>]+)[HMcl $w]*([HMcl ^>]*)>
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

 # But it was missing the default value for ''cmd'', ''HMtest_parse'', 
 # so I wrote one and applied it to a sample bit of HTML:

   proc HMtest_parse {tag state props body} {
    if {$state == ""} {
        set msg "Start $tag"
        if {$props != ""} {
            set msg "$msg with args: $props"
        }
        set msg "$msg\n$body"
    } else {
        set msg "End $tag"
    }
    puts $msg
   }

   HMparse_html {
      <html>
        <p class="bubba">
        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.
        </p>
        <p class="louielouie">
        This is my second paragraph, which is OK,
        but not as nice as my first one.
        </p>
      </html>
   }

This gives the following output:

 Start hmstart


 Start html


 Start p with args: class="bubba"

        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.

 End p
 Start p with args: class="louielouie"

        This is my second paragraph, which is OK,
        but not as nice as my first one.

 End p
 End html
 End hmstart

In fact, the code is not HTML-specific, and can handle simple XML code (e.g., that doesn't use the self-closing <tag/> format). It's like a mini-SAX. (Actually, it isn't quite like SAX. It's only like it because you define handlers for each tag. But unlike SAX it operates on a string in memory and doesn't execute until everything has been converted.) I've created a small XML parser based on this code and put it in TAX: A Tiny API for XML.

In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. The proc HMcl is introduced in order to avoid trouble with [ ]'s. But it can also be avoided by enclosing the value of exp in { }'s. Also, the variable w doesn't need to be defined (at least in recent Tcl versions): \s can be used instead. Here's the new HMparse_html proc:


    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set exp {<(/?)([^\s>]+)\s*([^>]*)>}
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

OK, one more thing... If the cmd is an ensemble, then the different tags can be sub-procs within the ensemble. For example, just like string length is a command, where string is the ensemble, and length is the sub-proc, it should be possible to set up cmd so that cmd p would invoke the proc for parsing p tags, cmd html would invoke the command for parsing html tags, etc.

It's pretty easy to create ensembles in snit, so here's a snit version:


    package require snit

    ############################################
    # Turn HTML into TCL commands
    #   html    A string containing an html document
    #   cmd                A command to run for each html tag found
    #   start        The name of the dummy html start/stop tags

    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set exp {<(/?)([^\s>]+)\s*([^>]*)>}
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

   snit::type parser {
        proc isend {state} {
            if {$state == ""} {
                return false
            } else {
                return true
            }
        }
        method hmstart {args} {}
        method html {state args} {
            if [isend $state] {
                puts "That's all, folks!"
            } else {
                puts "Let's get going!"
            }
        }
        method p {state props body} {
            if {![isend $state]} {puts $body}
        }
   }

   parser HMtest_parse

   HMparse_html {
      <html>
        <p class="bubba">
        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.
        </p>
        <p class="louielouie">
        This is my second paragraph, which is OK,
        but not as nice as my first one.
        </p>
      </html>
   }

This is the output:

 Let's get going!

        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.


        This is my second paragraph, which is OK,
        but not as nice as my first one.

 That's all, folks!