Version 18 of htmlparse

Updated 2007-03-16 18:55:58

htmlparse is a module in the tcllib library of Tcl code.

The htmlparse package provides commands that allow libraries and applications to parse HTML in a string into a representation of their choice. (From the man page [L1 ])

Documentation can be found at http://tcllib.sourceforge.net/doc/htmlparse.html


escargo 8 Aug 2005 - Once you have parsed the HTML file and have it in a tree (thanks to htmlparse::2tree), is there a convenient way to write the resulting tree back out as HTML? Or is that supposed to be obvious?

schlenk - As the tree is implemented via the struct::tree datastructure you should be able to simply [call] its walk method with a simple formatting proc to serialize the tree back to html. The html package may be helpful there.

escargo - I'll have to dig into it a bit more, but it appears that if I want to generate opening and closing tags, that of the eight possible traveral policies, only one would be right. It looks like order == both and dfs (depth-first search), which provides enter and leave actions and parent before and after children gives the opportunities to wrap subtrees in the proper opening and closing tags. Of course, I was hoping that htmlparse could perform round-trip operations. That way you could use 2tree, removeVisualFluff, removeFormDefs, and then output the modified tree. No such luck.


MSW Either it's me or htmlparse gets the structure of a HTML doc wrong.

 (description deleted)

schlenk Put a bug report on tcllib at SF for this.

MSW Done, #1008619 [L2 ].

HJG 2006-12-03: Issue is still open

escargo 3 Dec 2006 - There is also an issue parsing <script> </script> blocks where it gets confused by JavaScript expressions that are allowed by the standard (like i<j). That issue is still open as well.


Anybody know where to find an online document for the HTML DTD ?

Try the W3C: http://www.w3.org/TR/html4/sgml/dtd.html


Is this related to HTML display ?


Category Package, subset Tcllib