'''Stephen Uhler's HTML parser in 10 lines''', originally by [Stephen Uhler], previously known as '''HTML parser in 8 lines of Tcl''', and currently known as '''HTML parser in 4 lines of Tcl''', is a small toy [HTML] parser. It's not correct in that it can get messed up by angle brackets in attribute values and unbalanced braces in the HTML content, but it's an interesting code snippet nonetheless. ** Attributes ** location (defunct): http://freegis.org/cgi-bin/viewcvs.cgi/grass51/lib/form/html_library.tcl?rev=1.1&content-type=text/vnd.viewcvs-markup [https://groups.google.com/d/msg/comp.lang.tcl/3TtLuDPal9k/OUJEaHieGB0J%|%Didn't anyone go to last week's Tcl/Tk Workshop?] ,[comp.lang.tcl] ,1995-07-10: [https://groups.google.com/d/msg/comp.lang.tcl/_PFtAe-o2so/EcWfxP_3QO4J%|%Tcl/Tck and HTML] ,[comp.lang.tcl] ,1995-07-20: ** Description ** [EKB] et al: The [HTML] parser in 8 lines as posted on that page, and originally suggested by [Stephen Uhler]'s [HTML] parser in 8 lines is now actually in 4 lines. Here is the current version: ====== ############################################ # Turn HTML into TCL commands # html A string containing an html document # cmd A command to run for each html tag found # start The name of the dummy html start/stop tags proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { set exp {<(/?)([^ \t\r\n>]+)[ \t\r\n]*([^>]*)>} set sub "\}\n[list $cmd] {\\2} {\\1} {\\3} \{" regsub -all $exp [string map {\{ \&ob; \} \&cb;} $html] $sub html eval "$cmd {$start} {} {} \{ $html \}; $cmd {$start} / {} {}" } ====== But it was missing the default value for ''cmd'', ''HMtest_parse'', so I wrote one and applied it to a sample bit of HTML: ====== proc HMtest_parse {tag state props body} { if {$state eq {}} { set msg "Start $tag" if {$props ne {}} { set msg "$msg with args: $props" } set msg "$msg\n$body" } else { set msg "End $tag" } puts $msg } HMparse_html {

This is my very first paragraph. How do you like it? I think it has a lot to recommend it.

This is my second paragraph, which is OK, but not as nice as my first one.

} ====== '''Output''': ======none Start hmstart Start html Start p with args: class="bubba" This is my very first paragraph. How do you like it? I think it has a lot to recommend it. End p Start p with args: class="louielouie" This is my second paragraph, which is OK, but not as nice as my first one. End p End html End hmstart ====== In fact, the code is not HTML-specific, and can handle simple [XML] code (e.g., that doesn't use the self-closing format). It's like a mini-[SAX]. (Actually, it isn't quite like SAX. It's only like it because you define handlers for each tag. But unlike SAX it operates on a string in memory and doesn't execute until everything has been converted.) I've created a small XML parser based on this code and put it in [TAX: A Tiny API for XML]. [PYK]: '''the following comment refers to a previous, longer version of the parser which can be found in the history for this page''' In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. The proc ''HMcl'' is introduced in order to avoid trouble with [[ ]]'s. But it can also be avoided by enclosing the value of ''exp'' in { }'s. Also, the variable ''w'' doesn't need to be defined (at least in recent Tcl versions): \s can be used instead. Here's the new ''HMparse_html'' proc: ====== proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { regsub -all \{ $html {\&ob;} html regsub -all \} $html {\&cb;} html set exp {<(/?)([^\s>]+)\s*([^>]*)>} set sub "\}\n$cmd {\\2} {\\1} {\\3} \{" regsub -all $exp $html $sub html eval "$cmd {$start} {} {} \{ $html \}" eval "$cmd {$start} / {} {}" } ====== ---- OK, one more thing... If the ''cmd'' is an [ensemble], then the different tags can be sub-procs within the ensemble. For example, just like ''string length'' is a command, where [string] is the ensemble, and ''length'' is the sub-proc, it should be possible to set up ''cmd'' so that ''cmd p'' would invoke the proc for parsing p tags, ''cmd html'' would invoke the command for parsing html tags, etc. It's pretty easy to create ensembles in [snit], so here's a snit version: ====== package require snit ############################################ # Turn HTML into TCL commands # html A string containing an html document # cmd A command to run for each html tag found # start The name of the dummy html start/stop tags proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { regsub -all \{ $html {\&ob;} html regsub -all \} $html {\&cb;} html set exp {<(/?)([^\s>]+)\s*([^>]*)>} set sub "\}\n$cmd {\\2} {\\1} {\\3} \{" regsub -all $exp $html $sub html eval "$cmd {$start} {} {} \{ $html \}" eval "$cmd {$start} / {} {}" } snit::type parser { proc isend {state} { if {$state eq {}} { return false } else { return true } } method hmstart {args} {} method html {state args} { if [isend $state] { puts {That's all, folks!} } else { puts {Let's get going!} } } method p {state props body} { if {![isend $state]} {puts $body} } } parser HMtest_parse HMparse_html {

This is my very first paragraph. How do you like it? I think it has a lot to recommend it.

This is my second paragraph, which is OK, but not as nice as my first one.

} ====== '''Output''': ======none Let's get going! This is my very first paragraph. How do you like it? I think it has a lot to recommend it. This is my second paragraph, which is OK, but not as nice as my first one. That's all, folks! ====== ---- The problem with using snit (or [incr tcl] is you have to declare handlers for all tags or you will end up with a runtime error (for example "method body not found"). I myself use the following mechanism with some success: ====== proc HMtest_parse {tag state props body} { if {[info proc handle_$tag] ne {}} { handle_$tag $state $props $body } } proc handle_a {state props body} { ... } proc handle_img {state props body} { ... } ====== This way, you only have to declare handlers for the tags that you care about. Hai Vu ---- [WHD]: Actually, Snit allows you to define a method that receives all unknown methods: ====== delegate method * using {%s UnknownMethod %m} method UnknownMethod {methodName args} { ... } ====== <> HTML | XML | Parsing | Word and Text Processing | String Processing | Internet