Version 5 of htmlparse

Updated 2004-08-13 08:08:12

htmlparse is a module in the tcllib library of Tcl code.

The htmlparse package provides commands that allow libraries and applications to parse HTML in a string into a representation of their choice. (From the man page [L1 ])

Documentation can be found at http://tcllib.sourceforge.net/doc/htmlparse.html


MSW Either it's me or htmlparse gets the structure of a HTML doc wrong.

 $ cat wrong.tcl
 package require htmlparse
 set indent 0
 set t [struct::tree]
 set html {<html><head></head><body><h1>heading</h1><p>ayaken!</p></body></html>}
 proc painter {tree act node} {
        global indent
        if {$act == "enter"} then { 
                incr indent
                puts ">[string repeat - [expr {$indent-1}]][$tree get $node type]"
        } else {
                puts "<[string repeat - [expr {$indent-1}]][$tree get $node type]"
                incr indent -1
        }
 }

 htmlparse::2tree $html $t
 $t walk root -order both -type dfs {act node} {painter $t $act $node}

 $ tclsh wrong.tcl
 >root
 >-hmstart
 >--html
 >---head
 <---head
 >---body
 >----h1
 >-----PCDATA
 <-----PCDATA
 >-----p
 >------PCDATA
 <------PCDATA
 <-----p
 <----h1
 <---body
 <--html
 <-hmstart
 <root

Check where in the tree the p ended up. As child of the h1 ?? It should be a child of body.


Category Package, subset Tcllib