htmlparse

htmlparse is a module in the tcllib library of Tcl code.

The htmlparse package provides commands that allow libraries and applications to parse HTML in a string into a representation of their choice. (From the man page [1 ])

Documentation can be found at https://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/htmlparse/htmlparse.html
7/21/2019 Now here: https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/htmlparse/htmlparse.md


escargo 8 Aug 2005 - Once you have parsed the HTML file and have it in a tree (thanks to htmlparse::2tree), is there a convenient way to write the resulting tree back out as HTML? Or is that supposed to be obvious?

schlenk - As the tree is implemented via the struct::tree data structure you should be able to simply [call] its walk method with a simple formatting proc to serialize the tree back to html. The html package may be helpful there.

escargo - I'll have to dig into it a bit more, but it appears that if I want to generate opening and closing tags, that of the eight possible traversal policies, only one would be right. It looks like order == both and dfs (depth-first search), which provides enter and leave actions and parent before and after children gives the opportunities to wrap subtrees in the proper opening and closing tags. Of course, I was hoping that htmlparse could perform round-trip operations. That way you could use 2tree, removeVisualFluff, removeFormDefs, and then output the modified tree. No such luck.


MSW Either it's me or htmlparse gets the structure of a HTML doc wrong.

 (description deleted)

schlenk Put a bug report on tcllib at SF for this.

MSW Done, #1008619 [2 ].

HJG 2006-12-03: Issue is still open

escargo 3 Dec 2006 - There is also an issue parsing <script> </script> blocks where it gets confused by JavaScript expressions that are allowed by the standard (like i<j). That issue is still open as well.

escargo 15 Feb 2008 - My previous issue is still open, and a worse problem (for me) has surfaced as well. If the JavaScript includes document.write statements that write HTML (including <SCRIPT> tags inside of the literal strings), htmlparse gets seriously confused about the parse tree of the result.


escargo 18 Feb 2008 - After digging around in the code for htmlparse, I think it is inadequate for the task I need to perform. The HTML I need to parse has JavaScript inside of of <SCRIPT> tags that contain string literals that contain <SCRIPT> and other tags. htmlparse gets seriously confused using its present parsing mechanism. Possible solutions include changing ::htmlparse::PrepareHtml to elide those SCRIPT tags before the rest of the parse can see them (and get confused by them), or even switching to a different HTML parser. (I did try tDOM, but it gets confused by the input as well.)

Are there other HTML parsers in Tcl (or portable across Windows and Linux) that might be adequate to the task?

APN 2019-05-19 tDOM in its current incarnation can parse HTML using Google's Gumbo parser (option -html5). I would expect it to have no issues parsing anything that is accepted by a standard browser.

schlenk 18 Feb 2008 - There is tclwebtest but i doubt its much better than htmlparse, but take a look. Another option could be code from the hv3 application, the web browser based on the tkhtml3 package. Other not so nice options would be a wrapper around mozillas xml parser to extract the DOM tree from there.

escargo - I posted a message to the tkhtml Google group asking the question. One would like to think that there is a way to get the parsed HTML out without requiring a window to do the rendering. (One thing that was mentioned is that the code is in an alpha state until they freeze the interfaces. Maybe this is an interface request that will require rethinking existing interfaces.) They might not have considered that other applications might want HTML processing without visual rendering.

In the short term I was able to preprocess the HTML to remove the problematic portions (at least for the one specific instance I'm working with now).


Anybody know where to find an online document for the HTML DTD ?

Try the W3C: http://www.w3.org/TR/html4/sgml/dtd.html


Is this related to HTML display ?

See also: Parsing HTML