[EKB] This is a follow-up to the discussion on [Is Tcl Different!]. Here's the wonderful [HTML] parser in 10 lines as posted on that page, and originally suggested by [Stephen Uhler]: ---- ############################################ # Turn HTML into TCL commands # html A string containing an html document # cmd A command to run for each html tag found # start The name of the dummy html start/stop tags proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { regsub -all \{ $html {\&ob;} html regsub -all \} $html {\&cb;} html set w " \t\r\n" ;# white space proc HMcl x {return "\[$x\]"} set exp <(/?)([HMcl ^$w>]+)[HMcl $w]*([HMcl ^>]*)> set sub "\}\n$cmd {\\2} {\\1} {\\3} \{" regsub -all $exp $html $sub html eval "$cmd {$start} {} {} \{ $html \}" eval "$cmd {$start} / {} {}" } # But it was missing the default value for ''cmd'', ''HMtest_parse'', # so I wrote one and applied it to a sample bit of HTML: proc HMtest_parse {tag state props body} { if {$state == ""} { set msg "Start $tag" if {$props != ""} { set msg "$msg with args: $props" } set msg "$msg\n$body" } else { set msg "End $tag" } puts $msg } HMparse_html {

This is my very first paragraph. How do you like it? I think it has a lot to recommend it.

This is my second paragraph, which is OK, but not as nice as my first one.

} ---- This gives the following output: Start hmstart Start html Start p with args: class="bubba" This is my very first paragraph. How do you like it? I think it has a lot to recommend it. End p Start p with args: class="louielouie" This is my second paragraph, which is OK, but not as nice as my first one. End p End html End hmstart In fact, the code is not HTML-specific, and can handle simple [XML] code (e.g., that doesn't use the self-closing format). It's like a mini-[SAX]. (Actually, it isn't quite like SAX. It's only like it because you define handlers for each tag. But unlike SAX it operates on a string in memory and doesn't execute until everything has been converted.) I've created a small XML parser based on this code and put it in [TAX: A Tiny API for XML]. In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. The proc ''HMcl'' is introduced in order to avoid trouble with [[ ]]'s. But it can also be avoided by enclosing the value of ''exp'' in { }'s. Also, the variable ''w'' doesn't need to be defined (at least in recent Tcl versions): \s can be used instead. Here's the new ''HMparse_html'' proc: ---- proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { regsub -all \{ $html {\&ob;} html regsub -all \} $html {\&cb;} html set exp {<(/?)([^\s>]+)\s*([^>]*)>} set sub "\}\n$cmd {\\2} {\\1} {\\3} \{" regsub -all $exp $html $sub html eval "$cmd {$start} {} {} \{ $html \}" eval "$cmd {$start} / {} {}" } ---- OK, one more thing... If the ''cmd'' is an [ensemble], then the different tags can be sub-procs within the ensemble. For example, just like ''string length'' is a command, where [string] is the ensemble, and ''length'' is the sub-proc, it should be possible to set up ''cmd'' so that ''cmd p'' would invoke the proc for parsing p tags, ''cmd html'' would invoke the command for parsing html tags, etc. It's pretty easy to create ensembles in [snit], so here's a snit version: ---- package require snit ############################################ # Turn HTML into TCL commands # html A string containing an html document # cmd A command to run for each html tag found # start The name of the dummy html start/stop tags proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} { regsub -all \{ $html {\&ob;} html regsub -all \} $html {\&cb;} html set exp {<(/?)([^\s>]+)\s*([^>]*)>} set sub "\}\n$cmd {\\2} {\\1} {\\3} \{" regsub -all $exp $html $sub html eval "$cmd {$start} {} {} \{ $html \}" eval "$cmd {$start} / {} {}" } snit::type parser { proc isend {state} { if {$state == ""} { return false } else { return true } } method hmstart {args} {} method html {state args} { if [isend $state] { puts "That's all, folks!" } else { puts "Let's get going!" } } method p {state props body} { if {![isend $state]} {puts $body} } } parser HMtest_parse HMparse_html {

This is my very first paragraph. How do you like it? I think it has a lot to recommend it.

This is my second paragraph, which is OK, but not as nice as my first one.

} ---- This is the output: Let's get going! This is my very first paragraph. How do you like it? I think it has a lot to recommend it. This is my second paragraph, which is OK, but not as nice as my first one. That's all, folks! ---- The problem with using snit (or [incr tcl] is you have to declare handlers for all tags or you will end up with a runtime error (for example "method body not found"). I myself use the following mechanism with some success: proc HMtest_parse {tag state props body} { if {[info proc handle_$tag] != ""} { handle_$tag $state $props $body } } proc handle_a {state props body} { ... } proc handle_img {state props body} { ... } This way, you only have to declare handlers for the tags that you care about. Hai Vu !!!!!! %| [Category HTML] | [Category XML] | [Category Parsing] | [Category Word and Text Processing] | [Category String Processing] | [Category Internet] |% !!!!!!