HTML2text

What: HTML2text

 Where: From the contact
 Description: Tcl script which reads an HTML document and outputs the plain
        text of the document.  Designed to make it relatively easy for the
        user to configure how the program should mark specific HTML tags.
 Updated: 10/1997
 Contact: mailto:[email protected] (Joe Moss)

CM May 14th 03 - Get the source from Joe Moss current Home-Page: http://www.psg.com/~joem/tcl/


Roi Dayan writes, at the Tcl'ers Chat, a method for stripping HTML with an option to ignore specific tags:

      proc strip-html-ignore {text {ignore {}}} {
          set c 0
          foreach i $ignore {if {[regexp $i $text]} {return $text}}
          return ""
      }

      proc strip-html {html {ignore {}}} {
          regsub -all -- {<[^>]*>} $html "\[strip-html-ignore \[list &\] [list $ignore]\]" html
          set html [subst $html]
          return $html
      }

Syntax: strip-html text [list ignore1 ignore2]

Example:

      set a {<pre><a href=bla>roi<hr></a></pre><br>}
      puts [strip-html $a [list <br> <a.*>]]

will output:

      <a href=bla>roi<br> 

For big values it will raise error cause of special chars problems and such :) for big strings (like a whole page you fetch with http package) use this:

    proc strip-html {html {ignore {}}} {
        set m {[][\;\$]}
        regsub -all $m $html \\\\& html
        foreach i $html {
           regsub -all -- {<[^>]*>} $i "\[strip-html-ignore \[list &\] [list $ignore]\]" i
           set i [subst $i]
           lappend html2 $i
       }
       return $html2
    }


Joe Moss