html

HTML, or HyperText Markup Language, is a markup language used on the World-Wide Web.

Parsing Tools

Tcllib html: a module for generating html

htmlparse: tools to parse html

tkHTML: an extension that parses and renders HTML, compiled for use without Tk

tcltidy: a wrapper to Tidy

tkhtml3: the successor to tkHTML

tDOM's XPath-oriented parser: can be used to manipulate HTML

TclXML: includes xmlgen for generating HTML or XML

Tclgumbo: An interface to the Gumbo HTML5 parsing library

Generation Tools

html form generator, by CMcC: Generate HTML forms from Tcl lists.

MajaMaja: structure and layout a static collection of html pages arranging a wide variety of materials

Wub: includes a utility for structured HTML tag generation

Wiki format to HTML

Description

For extracting data from HTML, it's generally more robust to parse the HTML page into some document model, perhaps using tDOM, than to hack at it with regular expressions, and then using XPath to find the data.

If the task is to 'pull out' some data out of a HTML page, I'm indeed a strong believer in the 'parse the HTML page into a tree and query that tree' approach. For real life problems, I claim that this approach is much simpler and easier to maintain - and for sure, you have to maintain such a thingy, because the layout of HTML pages tend to change frequently - than every regexp approach. Sure, you have to learn another query language - xpath in this case. But if you are really in the web business, there are chances you have to learn xpath anyway.

html

Parsing Tools

Generation Tools

See Also

Description