htmlparse is a module in the tcllib library of Tcl code.
Documentation can be found at https://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/htmlparse/htmlparse.html
escargo 8 Aug 2005 - Once you have parsed the HTML file and have it in a tree (thanks to htmlparse::2tree), is there a convenient way to write the resulting tree back out as HTML? Or is that supposed to be obvious?
schlenk - As the tree is implemented via the struct::tree data structure you should be able to simply [call] its walk method with a simple formatting proc to serialize the tree back to html. The html package may be helpful there.
escargo - I'll have to dig into it a bit more, but it appears that if I want to generate opening and closing tags, that of the eight possible traversal policies, only one would be right. It looks like order == both and dfs (depth-first search), which provides enter and leave actions and parent before and after children gives the opportunities to wrap subtrees in the proper opening and closing tags. Of course, I was hoping that htmlparse could perform round-trip operations. That way you could use 2tree, removeVisualFluff, removeFormDefs, and then output the modified tree. No such luck.
MSW Either it's me or htmlparse gets the structure of a HTML doc wrong.
schlenk Put a bug report on tcllib at SF for this.
HJG 2006-12-03: Issue is still open
Are there other HTML parsers in Tcl (or portable across Windows and Linux) that might be adequate to the task?
APN 2019-05-19 tDOM in its current incarnation can parse HTML using Google's Gumbo parser (option -html5). I would expect it to have no issues parsing anything that is accepted by a standard browser.
schlenk 18 Feb 2008 - There is tclwebtest but i doubt its much better than htmlparse, but take a look. Another option could be code from the hv3 application, the web browser based on the tkhtml3 package. Other not so nice options would be a wrapper around mozillas xml parser to extract the DOM tree from there.
escargo - I posted a message to the tkhtml Google group asking the question. One would like to think that there is a way to get the parsed HTML out without requiring a window to do the rendering. (One thing that was mentioned is that the code is in an alpha state until they freeze the interfaces. Maybe this is an interface request that will require rethinking existing interfaces.) They might not have considered that other applications might want HTML processing without visual rendering.
In the short term I was able to preprocess the HTML to remove the problematic portions (at least for the one specific instance I'm working with now).
Anybody know where to find an online document for the HTML DTD ?
Try the W3C: http://www.w3.org/TR/html4/sgml/dtd.html
Is this related to HTML display ?
See also: Parsing HTML