Version 3 of Stephen Uhler's HTML parser in 10 lines

Updated 2005-07-27 07:55:21 by lwv

EKB This is a follow-up to the discussion on Is Tcl Different!.

Here's the wonderful HTML parser in 10 lines as posted on that page:

    ############################################
    # Turn HTML into TCL commands
    #   html    A string containing an html document
    #   cmd                A command to run for each html tag found
    #   start        The name of the dummy html start/stop tags

    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set w " \t\r\n"        ;# white space
        proc HMcl x {return "\[$x\]"}
        set exp <(/?)([HMcl ^$w>]+)[HMcl $w]*([HMcl ^>]*)>
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

But it was missing the default value for cmd, HMtest_parse, so I wrote one and applied it to a sample bit of HTML:

   proc HMtest_parse {tag state props body} {
    if {$state == ""} {
        set msg "Start $tag"
        if {$props != ""} {
            set msg "$msg with args: $props"
        }
        set msg "$msg\n$body"
    } else {
        set msg "End $tag"
    }
    puts $msg
   }

   HMparse_html {
      <html>
        <p class="bubba">
        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.
        </p>
        <p class="louielouie">
        This is my second paragraph, which is OK,
        but not as nice as my first one.
        </p>
      </html>
   }

This gives the following output:

 Start hmstart


 Start html


 Start p with args: class="bubba"

        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.

 End p
 Start p with args: class="louielouie"

        This is my second paragraph, which is OK,
        but not as nice as my first one.

 End p
 End html
 End hmstart

In fact, the code is not HTML-specific, and can handle simple XML code (e.g., that doesn't use the self-closing <tag/> format). It's like a mini-SAX.

In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. The proc HMcl is introduced in order to avoid trouble with [ ]'s. But it can also be avoided by enclosing the value of exp in { }'s. Also, the variable w doesn't need to be defined (at least in recent Tcl versions): \s can be used instead. Here's the new HMparse_html proc:

    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set exp {<(/?)([^\s>]+)\s*([^>]*)>}
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

OK, one more thing... If the cmd is an ensemble, then the different tags can be sub-procs within the ensemble. For example, just like string length is a command, where string is the ensemble, and length is the sub-proc, it should be possible to set up cmd so that cmd p would invoke the proc for parsing p tags, cmd html would invoke the command for parsing html tags, etc.

It's pretty easy to create ensembles in snit, so here's a snit version:

    package require snit

    ############################################
    # Turn HTML into TCL commands
    #   html    A string containing an html document
    #   cmd                A command to run for each html tag found
    #   start        The name of the dummy html start/stop tags

    proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
        regsub -all \{ $html {\&ob;} html
        regsub -all \} $html {\&cb;} html
        set exp {<(/?)([^\s>]+)\s*([^>]*)>}
        set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
        regsub -all $exp $html $sub html
        eval "$cmd {$start} {} {} \{ $html \}"
        eval "$cmd {$start} / {} {}"
   }

   snit::type parser {
        proc isend {state} {
            if {$state == ""} {
                return false
            } else {
                return true
            }
        }
        method hmstart {args} {}
        method html {state args} {
            if [isend $state] {
                puts "That's all, folks!"
            } else {
                puts "Let's get going!"
            }
        }
        method p {state props body} {
            if {![isend $state]} {puts $body}
        }
   }

   parser HMtest_parse

   HMparse_html {
      <html>
        <p class="bubba">
        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.
        </p>
        <p class="louielouie">
        This is my second paragraph, which is OK,
        but not as nice as my first one.
        </p>
      </html>
   }

This is the output:

 Let's get going!

        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.


        This is my second paragraph, which is OK,
        but not as nice as my first one.

 That's all, folks!

Category XML Category Word and Text Processing Category Internet