Version 34 of char2ent

Updated 2022-09-15 03:19:52 by LES
What: char2ent.tcl
Where: https://pastebin.com/c2dPigpd
Description: Read an HTML or XML file containing literal special characters, then replace specials with appropriate SGML entities and output.
Currently at version 0.0 .
Updated: 11/2004

Contact: LES


LES on November 18, 2004: Get the code here: [L1 ]

Or here:

#!/usr/bin/env tclsh

# char2ent.tcl - Opens an HTML or XML file with special characters 
#                 (diacritics) written in plain text, replaces these special 
#                 characters with appropriate HTML or XML entities and writes the 
#                output to a new file. 
#
# Author: Luciano Espirito Santo 
#
# History 
# 
#         Version 1.0        2004-11-18        Luciano Espirito Santo
#                 First version. Alpha stage.
#
#                 KNOWN ISSUES: 
#                - No user-proof measures, no error or exception handling, no nothing! 
#                   No guarantees! Use it at your own risk!
#                - Tested on Windows 98 and Linux only.
#
#                TODO: 
#                 - Extend it so it can handle all possible characters (Unicode).
#                - Make the ability to do the INVERSE operation (that would include the 
#                  ability to tell non-escaped entities from escaped entities and NOT 
#                  replace the escaped entities.
#                - Make it handle STDIN.
#
#                LICENSE: BSD
#
# How to use it:
# 
# char2ent.tcl  --help

# ----------------------------------------------------------------
# Do not change anything below this point unless you know what you're doing.


# Print help text and exit if '--help' is the only argument 
if  {[llength $argc] == 1 && [lindex $argv 0] == "--help"}        {
    puts  ""
    puts  "char2ent, by Luciano Espirito Santo - 2004"
    puts  ""
    puts  {Usage: char2ent -[option]  "input file"  "output file"}
    puts  ""
    puts  "Possible options:"
    puts  "-h: convert special characters to HTML entities"
    puts  "-x: convert special characters to XML entities"
    puts  ""
    puts  {"input file" MUST exist}
    puts  {"output file" is created automatically if it does not exist}
    puts  {"input file" and "output file" MUST NOT be the same file}
    puts  ""
    puts  {Example: char2ent -x "sample.xml" "converted.xml"}
    puts  ""
    exit
}

# Complain and exit if option is neither '-h' nor '-x' 
if  {[lindex $argv 0] != "-h" && [lindex $argv 0] != "-x"}  {
    puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
    exit
}

# Complain and exit if not exactly 3 arguments (option, input, output) are found 
if  {$argc != 3}  {
    puts  "Error! You must use exactly 3 arguments.\n"
    puts  "$argv\n"
    puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
    exit
}

# Complain and exit if input file does not exist 
if  {! [file exists [lindex $argv 1]]}  {
    puts  "Error! File \"[lindex $argv 1]\" not found!\n"
    exit
}

# Complain and exit if input file is not readable 
if  {! [file readable [lindex $argv 1]]}  {
    puts  "Error! Permission denied to read [ lindex $argv 1 ]!\n"
    exit
}

# Complain if input file and output file are the same 
if  {[lindex $argv 1] == [lindex $argv 2]}  {
    puts  "Error! \"input file\" and \"output file\" must not be the same.\n"
    exit
}

# Try to open input file for reading. 
# Complain and exit in case of errors. 
if  {[catch {set IF [open [lindex $argv 1] r]} IFerror]}  {
    puts  "Error! $IFerror\n"
    exit
}

# Try to open output file for writing. 
# Complain, close input file and exit in case of errors. 
if  {[catch {set OF [open [lindex $argv 2] w]} OFerror]}        {
    close $IF
    puts  "Error! $OFerror\n"
    exit
}
        

# ================================================
# Two files open. No errors this far. Let's replace. 

set  CHARS  {
    ª    º    À    Á    Â    Ã    Ä    Å    Æ    Ç        
    È    É    Ê    Ë    Ì    Í    Î    Ï    Ð    Ñ        
    Ò    Ó    Ô    Õ    Ö    Ø    Ù    Ú    Û    Ü        
    Ý    Þ    ß    à    á    â    ã    ä    å    æ        
    ç    è    é    ê    ë    ì    í    î    ï    ð        
    ñ    ò    ó    ô    õ    ö    ø    ù    ú    û        
    ü    ý    þ    ÿ    Œ    œ    Ÿ
}

set  HTML  {
    ª    º    À    Á    Â    Ã    Ä        
    Å   &Aelig;   Ç    È    É   Ê     Ë        
    Ì  Í  Î     Ï      Ð      Ñ    Ò        
    Ó  Ô   Õ    Ö      Ø   Ù    Ú        
    Û   Ü    Ý    Þ     ß    à    á        
    â   ã  ä      å     æ    ç    è        
    é  ê   ë      ì    í   î     ï      ð        
    ñ  ò  ó    ô     õ   ö      ø        
    ù  ú  û     ü      ý   þ     ÿ        
    &Oelig;   œ   Ÿ
}

set  XML  {
    ª    º    À    Á    Â    Ã    Ä    Å    Æ        
    Ç    È    É    Ê    Ë    Ì    Í    Î    Ï        
    Ð    Ñ    Ò    Ó    Ô    Õ    Ö    Ø    Ù        
    Ú    Û    Ü    Ý    Þ    ß    à    á    â        
    ã    ä    å    æ    ç    è    é    ê    ë        
    ì    í    î    ï    ð    ñ    ò    ó    ô        
    õ    ö    ø    ù    ú    û    ü    ý    þ        
    ÿ    Œ    œ    Ÿ
}


set TEXT [read $IF]

for  {set i 0}  {$i < [llength $CHARS]}  {incr i}  {
     switch -- [lindex $argv 0]  {
         "-h"    {set REPL [lindex $HTML $i]}
         "-x"    {set REPL [lindex $XML  $i]}
     }
    set TEXT [string map "[lindex $CHARS $i] $REPL" $TEXT]
}

puts -nonewline $OF $TEXT
close $IF
close $OF

exit

The code was posted here entirely at first, but I never succeeded in having all characters displayed correctly. Today (May 2007), I saw that many other characters were displayed incorrectly: a whole series of them displayed as question marks. I spent almost half an hour trying to correct it, but I couldn't make it work. Wikit keeps complaining about some "bogus" character and blames my browser (Firefox). Konqueror didn't work either. I give up. Just download the thing and move on.

LES in 2022: it seems the Wiki displays all the characters correctly now. Yay!


RS Note that read includes the trailing newline; to write such a file-string out it is best to use puts -nonewline so you don't get an extra one. Also, string map is happy with long maps, so you can avoid the for loop by just coding

set XMLmap {& & < < > > ...}
set HTMLmap {Ä Ä ...}
...
set myMap $XMLmap
...
set myText [string map $myMap $mytext]