Version 2 of jsi

Updated 2005-02-23 22:51:58 by jsi

J. Siekmeyer - up to now still in the reading/fixing typos/fighting wiki-spam state of "wikit-activism".

Title: char2ent Date: 23 Feb 2005 01:09:10 GMT Site: LES,200.171.159.14

Posted by LES on November 18, 2004

 #!/usr/local/bin/tclsh

 # char2ent.tcl - Opens an HTML or XML file with special characters 
 #                         (diacritics) written in plain text, replaces these special 
 #                         characters with appropriate HTML or XML entities and writes 
 #                        the output to a new file. 
 #
 # Author: Luciano Espirito Santo 
 #
 # History 
 # 
 #         Version 1.0        2004-11-18        Luciano Espirito Santo
 #                 First version. Alpha stage.
 #
 #                 KNOWN ISSUES: 
 #                - No user-proof measures, no error or exception handling, no nothing! 
 #                   No guarantees! Use it at your own risk!
 #                - Tested on Windows 98 only. Permission issues are likely to come up 
 #                  in other operating systems. Permission issues are harmless. The 
 #                  program will just not be able to read the input file and/or write 
 #                  to the output file.
 #
 #                TODO: 
 #                - Make the ability to do the INVERSE operation (that would include 
 #                  the ability to tell non-escaped entities from escaped entities and 
 #                  NOT replace the escaped entities.
 #                - Make it handle STDIN.
 #
 #                LICENSE: BSD
 #
 # How to use it:
 # 
 # char2ent.tcl  --help

 # ----------------------------------------------------------------
 # Do not change anything below this point unless you know what you're doing.


 # Print help text and exit if '--help' is the only argument 
 if          { [ llength $argc ] == 1  &&  [ lindex $argv 0 ] == "--help" }          {
         puts  ""
         puts  "char2ent, by Luciano Espirito Santo - 2004"
         puts  ""
         puts  {Usage: char2ent -[option]  "input file"  "output file"}
         puts  ""
         puts  "Possible options:"
         puts  "-h: convert special characters to HTML entities"
         puts  "-x: convert special characters to XML entities"
         puts  ""
         puts  {"input file" MUST exist}
         puts  {"output file" is created automatically if it does not exist}
         puts  {"input file" and "output file" MUST NOT be the same file}
         puts  ""
         puts  {Example: char2ent -x  "sample.xml"  "converted.xml"}
         puts  ""
         exit
 }

 # Complain and exit if option is neither '-h' nor '-x' 
 if          { [ lindex $argv 0 ] != "-h"  &&  [ lindex $argv 0 ] != "-x" }          {
         puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
         exit
 }

 # Complain and exit if not exactly 3 arguments (option, input, output) found 
 if          { $argc  !=  3 }          {
         puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
         exit
 }

 # Complain and exit if input file does not exist 
 if          {! [ file exists [ lindex $argv 1 ] ] }          {
         puts  "Error! File \"[ lindex $argv 1 ]\" not found!\n"
         exit
 }

 # Complain and exit if input file is not readable 
 if          {! [ file readable [ lindex $argv 1 ] ] }          {
         puts  "Error! Permission denied to read [ lindex $argv 1 ]!\n"
         exit
 }

 # Complain if input file and output file are the same 
 if          { [ lindex $argv 1 ]  ==  [ lindex $argv 2 ] }          {
         puts  "Error! \"input file\" and \"output file\" must not be the same.\n"
         exit
 }

 # Try to open input file for reading. 
 # Complain and exit in case of errors. 
 if          { [ catch { set myIF [ open [ lindex $argv 1 ] r ] } myIFError ] }  {
         puts  "Error! $myIFError\n"
         exit
 }

 # Try to open output file for writing. 
 # Complain, close input file and exit in case of errors. 
 if          { [ catch { set myOF [ open [ lindex $argv 2 ] w ] } myOFError ] }  {
         close  $myIF
         puts  "Error! $myOFError\n"
         exit
 }


 # ===============================================
 # Two files open. No errors  this far. Let's replace. 

 set  myChars  {
         ª        º        À        Á        ��        ��        Ä        Å        Æ        Ç        
         È        É        Ê        Ë        Ì        Í        Î        Ï        Ð        Ñ        
         Ò        Ó        Ô        Õ        Ö        Ø        Ù        Ú        Û        Ü        
         Ý        Þ        ß        à        á        â        ã        ä        å        æ        
        ç        è        é        ê        ë        ì        í        î        ï        ð        
         ñ        ò        ó        ô        õ        ö        ø        ù        ú        û        
         ü        ý        þ        ÿ        ¿        OE        oe        Ÿ
 }

 set  myHTML  {
         ª        º        À        Á        Â        Ã
        Ä        Å        &Aelig;                Ç        È        É
         Ê        Ë        Ì        Í        Î        Ï
         Ð        Ñ        Ò        Ó        Ô        
         Õ        Ö        Ø        Ù        Ú
         Û        Ü        Ý        Þ        ß        à
         á        â        ã        ä        å        æ
         ç        è        é        ê        ë        ì
         í        î        ï        ð        ñ        ò
        ó        ô        õ        ö        ø
        ù        ú        û        ü        ý
        þ        ÿ        &Oelig;        œ        Ÿ
 }

 set  myXML  {
         ª        º        À        Á        Â        Ã        Ä        Å        Æ        
         Ç        È        É        Ê        Ë        Ì        Í        Î        Ï        
         Ð        Ñ        Ò        Ó        Ô        Õ        Ö        Ø        Ù        
         Ú        Û        Ü        Ý        Þ        ß        à        á        â        
         ã        ä        å        æ        ç        è        é        ê        ë        
         ì        í        î        ï        ð        ñ        ò        ó        ô        
         õ        ö        ø        ù        ú        û        ü        ý        þ        
         ÿ        Œ        œ        Ÿ
 }


 set  myText  [ read $myIF ]

 for          { set i 0 }   { $i  < [ llength  $myChars ] }   { incr i }          {

         switch  --  [ lindex $argv 0 ]          {
                 "-h"        { set  myReplace  [ lindex  $myHTML  $i ] }
                 "-x"        { set  myReplace  [ lindex  $myXML  $i ] }
         }

         set  myText [ string map "[ lindex $myChars $i ] $myReplace" $myText ]
 }


 puts -nonewline  $myOF  $myText
 close  $myIF
 close  $myOF

 exit

LES: Unfortunately, I can't make wikit display the numeric codes in the "set myXML" line... Jean-Claude?


RS Note that read includes the trailing newline; to write such a file-string out it is best to use puts -nonewline so you don't get an extra one. Also, string map is happy with long maps, so you can avoid the for loop by just coding

 set XMLmap {& &amp; < &lt; > &gt; ...}
 set HTMLmap {���� &Auml; ...}
 ...
 set myMap $XMLmap
 ...
 set myText [string map $myMap $mytext]

JSI 23feb05 The numeric codes in the "set myXML" line are preserved when requesting http://mini.net/tcl/13008.txt instead of http://mini.net/tcl/13008 , this will deliver the raw text unchanged. Doesn't exactly solve the problem, but is my best idea so far. Unformatted text is used here 1) for Tcl-code (where any "help" of the wikit-parser in most cases is unwanted) and 2) for a lot of lists like Tcl editors, where for example the automagic link generation is of good use. There has been discussion about moving Tcl-code elsewhere or using a special markup for Tcl-code. I assume and agree that searching and converting all already existing pages containing Tcl-code would generate a lot of work compared to adding ".txt" to an URL for those few pages which suffer from "parser-damaging" ;-)

LES: Oh, boy. Someone's browser it not too good with i18n and weird characters. Messed up all of them in the set myChars line. I might have fixed now. No editor in my Linux installation seems to be capable of displaying the three last characters correctly, so they could be wrong now (DANGER! DANGER!). Boy, I'd be glad to support any "death to internationalization-challenged browsers/applications/pages" movement or campaign! This is the 21st century, for chrissake. Thank God Tcl has no problems with that.

Anyway, JSI made interesting remarks. Perhaps we could/should start a new page called [parser-damaging] to discuss/expose this .txt option? Feedback appreciated. I am not so fond of going nuts and creating new pages like there is no tomorrow, like some notable neophites.



Category Person