What: HTML2text
Where: From the contact Description: Tcl script which reads an HTML document and outputs the plain text of the document. Designed to make it relatively easy for the user to configure how the program should mark specific HTML tags. Updated: 10/1997 Contact: mailto:[email protected] (Joe Moss)
CM May 14th 03 - Get the source from Joe Moss current Home-Page: http://www.psg.com/~joem/tcl/
Roi Dayan writes, at the Tcl'ers Chat, a method for stripping HTML with an option to ignore specific tags:
proc strip-html-ignore {text {ignore {}}} { set c 0 foreach i $ignore {if {[regexp $i $text]} {return $text}} return "" } proc strip-html {html {ignore {}}} { regsub -all -- {<[^>]*>} $html "\[strip-html-ignore \[list &\] [list $ignore]\]" html set html [subst $html] return $html }
Syntax: strip-html text [list ignore1 ignore2]
Example:
set a {<pre><a href=bla>roi<hr></a></pre><br>} puts [strip-html $a [list <br> <a.*>]]
will output:
<a href=bla>roi<br>
For big values it will raise error cause of special chars problems and such :) for big strings (like a whole page you fetch with http package) use this:
proc strip-html {html {ignore {}}} { set m {[][\;\$]} regsub -all $m $html \\\\& html foreach i $html { regsub -all -- {<[^>]*>} $i "\[strip-html-ignore \[list &\] [list $ignore]\]" i set i [subst $i] lappend html2 $i } return $html2 }