htmlparse

Difference between version 26 and 32 - Previous - Next
htmlparse is a module in the [tcllib] library of Tcl code. 

The htmlparse package provides commands that allow libraries and applications 
to parse [HTML] in a string into a representation of their choice. 
(From the man page [http://tcllib.sourceforge.net/doc/htmlparse.html])
Documentation can be found at https://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/htmlparse/htmlparse.html<<br>>
7/21/2019 Now here: https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/htmlparse/htmlparse.md
----
''[escargo] 8 Aug 2005'' - Once you have parsed the [HTML] file and have it in a tree 
(thanks to ''htmlparse::2tree''), is there a convenient way to write the resulting tree 
back out as HTML?
Or is that supposed to be obvious?

[schlenk] - As the tree is implemented via the [struct]::tree data structure 
you should be able to simply [[call]] its walk method with a simple formatting proc 
to serialize the tree back to html.
The [html] package may be helpful there.

[escargo] - I'll have to dig into it a bit more, but it appears that if I want to generate opening
and closing tags, that of the eight possible traversal policies, only one would be right.  
It looks like ''order == both'' and ''dfs'' (depth-first search), 
which provides enter and leave actions and parent before and after children 
gives the opportunities to wrap subtrees in the proper opening and closing tags.  
Of course, I was hoping that htmlparse could perform round-trip operations.  
That way you could use 2tree, removeVisualFluff, removeFormDefs, 
and then output the modified tree.
No such luck.
----
[MSW] Either it's me or htmlparse gets the structure of a HTML doc wrong.
 (description deleted)

[schlenk] Put a bug report on tcllib at SF for this.

[MSW] Done, #1008619 [https://sourceforge.net/tracker/index.php?func=detail&aid=1008619&group_id=12883&atid=112883].

[HJG] 2006-12-03: Issue is still open

''[escargo] 3 Dec 2006'' - There is also an issue parsing <script> </script> blocks 
where it gets confused by JavaScript expressions that are allowed by the standard (like i<j). 
That issue is still open as well.

''[escargo] 15 Feb 2008'' - My previous issue is still open, and a worse problem (for me) has
surfaced as well. If the [JavaScript] includes document.write statements that write HTML (including
<SCRIPT> tags inside of the literal strings), htmlparse gets seriously confused about the parse tree
of the result.
----
''[escargo] 18 Feb 2008'' - After digging around in the code for htmlparse, I think it is
inadequate for the task I need to perform. The HTML I need to parse has JavaScript inside of
of <SCRIPT> tags that contain string literals that contain <SCRIPT> and other tags. htmlparse
gets seriously confused using its present parsing mechanism. Possible solutions include changing
::htmlparse::PrepareHtml to elide those SCRIPT tags before the rest of the parse can see them
(and get confused by them), or even switching to a different HTML parser. (I did try tDOM, butit gets confused by the input as well.) 

Are there other HTML parsers in Tcl (or portable across Windows and Linux) that might be adequate
to the task?
[APN] 2019-05-19 tDOM in its current incarnation can
parse HTML using Google's Gumbo parser (option -html5). I would expect it to have no issues parsing
anything that is accepted by a standard browser.

''[schlenk] 18 Feb 2008'' - There is [tclwebtest] but i doubt its much better than htmlparse, but take a look. Another option could be code from the [hv3] application, the web browser based on the [tkhtml3] package. Other not so nice options would be a wrapper around mozillas xml parser to extract the DOM tree from there.

''escargo'' - I posted a message to the [tkhtml] Google group asking the question. One would like
to think that there is a way to get the parsed HTML out without requiring a window to do the rendering.
(One thing that was mentioned is that the code is in an alpha state until they freeze the interfaces.
Maybe this is an interface request that will require rethinking existing interfaces.) They might not
have considered that other applications might want HTML processing without visual rendering.

In the short term I was able to preprocess the HTML to remove the problematic portions (at least for
the one specific instance I'm working with now).

----
Anybody know where to find an online document for the HTML [DTD] ?

Try the W3C: http://www.w3.org/TR/html4/sgml/dtd.html
----
Is this related to [HTML display] ?


See also: [Parsing HTML]

<<categories>> Package | Tcllib | Web