Version 13 of Stephen Uhler's HTML parser in 10 lines

Updated 2012-12-08 19:14:38 by RLE

EKB This is a follow-up to the discussion on Is Tcl Different!.

Here's the wonderful HTML parser in 10 lines as posted on that page, and originally suggested by Stephen Uhler:


    ############################################
    # Turn HTML into TCL commands
    #   html    A string containing an html document
    #   cmd                A command to run for each html tag found
    #   start        The name of the dummy html start/stop tags

    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set w " \t\r\n"        ;# white space
        proc HMcl x {return "\[$x\]"}
        set exp <(/?)([HMcl ^$w>]+)[HMcl $w]*([HMcl ^>]*)>
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

 # But it was missing the default value for ''cmd'', ''HMtest_parse'', 
 # so I wrote one and applied it to a sample bit of HTML:

   proc HMtest_parse {tag state props body} {
    if {$state == ""} {
        set msg "Start $tag"
        if {$props != ""} {
            set msg "$msg with args: $props"
        }
        set msg "$msg\n$body"
    } else {
        set msg "End $tag"
    }
    puts $msg
   }
   
   HMparse_html {
      <html>
        <p class="bubba">
        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.
        </p>
        <p class="louielouie">
        This is my second paragraph, which is OK,
        but not as nice as my first one.
        </p>
      </html>
   }

This gives the following output:

 Start hmstart


 Start html


 Start p with args: class="bubba"

        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.

 End p
 Start p with args: class="louielouie"

        This is my second paragraph, which is OK,
        but not as nice as my first one.

 End p
 End html
 End hmstart

In fact, the code is not HTML-specific, and can handle simple XML code (e.g., that doesn't use the self-closing <tag/> format). It's like a mini-SAX. (Actually, it isn't quite like SAX. It's only like it because you define handlers for each tag. But unlike SAX it operates on a string in memory and doesn't execute until everything has been converted.) I've created a small XML parser based on this code and put it in TAX: A Tiny API for XML.

In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. The proc HMcl is introduced in order to avoid trouble with [ ]'s. But it can also be avoided by enclosing the value of exp in { }'s. Also, the variable w doesn't need to be defined (at least in recent Tcl versions): \s can be used instead. Here's the new HMparse_html proc:


    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set exp {<(/?)([^\s>]+)\s*([^>]*)>}
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

OK, one more thing... If the cmd is an ensemble, then the different tags can be sub-procs within the ensemble. For example, just like string length is a command, where string is the ensemble, and length is the sub-proc, it should be possible to set up cmd so that cmd p would invoke the proc for parsing p tags, cmd html would invoke the command for parsing html tags, etc.

It's pretty easy to create ensembles in snit, so here's a snit version:


    package require snit
    
    ############################################
    # Turn HTML into TCL commands
    #   html    A string containing an html document
    #   cmd                A command to run for each html tag found
    #   start        The name of the dummy html start/stop tags

    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set exp {<(/?)([^\s>]+)\s*([^>]*)>}
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }
   
   snit::type parser {
        proc isend {state} {
            if {$state == ""} {
                return false
            } else {
                return true
            }
        }
        method hmstart {args} {}
        method html {state args} {
            if [isend $state] {
                puts "That's all, folks!"
            } else {
                puts "Let's get going!"
            }
        }
        method p {state props body} {
            if {![isend $state]} {puts $body}
        }
   }
   
   parser HMtest_parse
   
   HMparse_html {
      <html>
        <p class="bubba">
        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.
        </p>
        <p class="louielouie">
        This is my second paragraph, which is OK,
        but not as nice as my first one.
        </p>
      </html>
   }

This is the output:

 Let's get going!

        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.


        This is my second paragraph, which is OK,
        but not as nice as my first one.

 That's all, folks!

The problem with using snit (or incr tcl is you have to declare handlers for all tags or you will end up with a runtime error (for example "method body not found"). I myself use the following mechanism with some success:

    proc HMtest_parse {tag state props body} {
        if {[info proc handle_$tag] != ""} {
            handle_$tag $state $props $body
        }
    }

    proc handle_a {state props body} { ... }
    proc handle_img {state props body} { ... }

This way, you only have to declare handlers for the tags that you care about.

Hai Vu


WHD: Actually, Snit allows you to define a method that receives all unknown methods:

    delegate method * using {%s UnknownMethod %m}

    method UnknownMethod {methodName args} { ... }