Version 33 of char2ent

Updated 2022-09-15 03:07:55 by LES
What: char2ent.tcl
Where: https://pastebin.com/c2dPigpd
Description: Read an HTML or XML file containing literal special characters, then replace specials with appropriate SGML entities and output.
Currently at version 0.0 .
Updated: 11/2004

Contact: LES


LES on November 18, 2004: Get the code here: [L1 ]

Or here:

#!/usr/bin/env tclsh

# char2ent.tcl - Opens an HTML or XML file with special characters 
#                 (diacritics) written in plain text, replaces these special 
#                 characters with appropriate HTML or XML entities and writes the 
#                output to a new file. 
#
# Author: Luciano Espirito Santo 
#
# History 
# 
#         Version 1.0        2004-11-18        Luciano Espirito Santo
#                 First version. Alpha stage.
#
#                 KNOWN ISSUES: 
#                - No user-proof measures, no error or exception handling, no nothing! 
#                   No guarantees! Use it at your own risk!
#                - Tested on Windows 98 and Linux only.
#
#                TODO: 
#                 - Extend it so it can handle all possible characters (Unicode).
#                - Make the ability to do the INVERSE operation (that would include the 
#                  ability to tell non-escaped entities from escaped entities and NOT 
#                  replace the escaped entities.
#                - Make it handle STDIN.
#
#                LICENSE: BSD
#
# How to use it:
# 
# char2ent.tcl  --help

# ----------------------------------------------------------------
# Do not change anything below this point unless you know what you're doing.


# Print help text and exit if '--help' is the only argument 
if        {[llength $argc] == 1 && [lindex $argv 0] == "--help"}        {
        puts  ""
        puts  "char2ent, by Luciano Espirito Santo - 2004"
        puts  ""
        puts  {Usage: char2ent -[option]  "input file"  "output file"}
        puts  ""
        puts  "Possible options:"
        puts  "-h: convert special characters to HTML entities"
        puts  "-x: convert special characters to XML entities"
        puts  ""
        puts  {"input file" MUST exist}
        puts  {"output file" is created automatically if it does not exist}
        puts  {"input file" and "output file" MUST NOT be the same file}
        puts  ""
        puts  {Example: char2ent -x "sample.xml" "converted.xml"}
        puts  ""
        exit
}

# Complain and exit if option is neither '-h' nor '-x' 
if        {[lindex $argv 0] != "-h" && [lindex $argv 0] != "-x"}          {
        puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
        exit
}

# Complain and exit if not exactly 3 arguments (option, input, output) are found 
if        {$argc != 3}        {
        puts  "Error! You must use exactly 3 arguments.\n"
        puts  "$argv\n"
        puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
        exit
}

# Complain and exit if input file does not exist 
if        {! [file exists [lindex $argv 1]]          {
        puts  "Error! File \"[lindex $argv 1]\" not found!\n"
        exit
}

# Complain and exit if input file is not readable 
if        {! [file readable [lindex $argv 1]]}        {
        puts  "Error! Permission denied to read [ lindex $argv 1 ]!\n"
        exit
}

# Complain if input file and output file are the same 
if        {[lindex $argv 1] == [lindex $argv 2]}        {
        puts  "Error! \"input file\" and \"output file\" must not be the same.\n"
        exit
}

# Try to open input file for reading. 
# Complain and exit in case of errors. 
if        {[catch {set IF [open [lindex $argv 1] r]} IFerror]}  {
        puts  "Error! $IFerror\n"
        exit
}

# Try to open output file for writing. 
# Complain, close input file and exit in case of errors. 
if        {[catch {set OF [open [lindex $argv 2] w]} OFerror]}        {
        close $IF
        puts  "Error! $OFerror\n"
        exit
}
        

# ================================================
# Two files open. No errors this far. Let's replace. 

set  CHARS  {
        ª        º        À        Á        Â        Ã        Ä        Å        Æ        Ç        
        È        É        Ê        Ë        Ì        Í        Î        Ï        Ð        Ñ        
        Ò        Ó        Ô        Õ        Ö        Ø        Ù        Ú        Û        Ü        
        Ý        Þ        ß        à        á        â        ã        ä        å        æ        
        ç        è        é        ê        ë        ì        í        î        ï        ð        
        ñ        ò        ó        ô        õ        ö        ø        ù        ú        û        
        ü        ý        þ        ÿ        Œ        œ        Ÿ
}

set  HTML  {
        ª        º        À        Á        Â        Ã        Ä        
        Å        &Aelig;        Ç        È        É        Ê        Ë        
        Ì        Í        Î        Ï        Ð        Ñ        Ò        
        Ó        Ô        Õ        Ö        Ø        Ù        Ú        
        Û        Ü        Ý        Þ        ß        à        á        
        â        ã        ä        å        æ        ç        è        
        é        ê        ë        ì        í        î        ï        ð        
        ñ        ò        ó        ô        õ        ö        ø        
        ù        ú        û        ü        ý        þ        ÿ        
        &Oelig;        œ        Ÿ
}

set  XML  {
        ª        º        À        Á        Â        Ã        Ä        Å        Æ        
        Ç        È        É        Ê        Ë        Ì        Í        Î        Ï        
        Ð        Ñ        Ò        Ó        Ô        Õ        Ö        Ø        Ù        
        Ú        Û        Ü        Ý        Þ        ß        à        á        â        
        ã        ä        å        æ        ç        è        é        ê        ë        
        ì        í        î        ï        ð        ñ        ò        ó        ô        
        õ        ö        ø        ù        ú        û        ü        ý        þ        
        ÿ        Œ        œ        Ÿ
}


set TEXT [read $IF]

for        {set i 0}        {$i < [llength $CHARS]}        {incr i}          {

        switch -- [lindex $argv 0]        {
                "-h"        {set REPL [lindex $HTML $i]}
                "-x"        {set REPL [lindex $XML  $i]}
        }

        set TEXT [string map "[lindex $CHARS $i] $REPL" $TEXT]
}


puts -nonewline $OF $TEXT
close $IF
close $OF

exit

The code was posted here entirely at first, but I never succeeded in having all characters displayed correctly. Today (May 2007), I saw that many other characters were displayed incorrectly: a whole series of them displayed as question marks. I spent almost half an hour trying to correct it, but I couldn't make it work. Wikit keeps complaining about some "bogus" character and blames my browser (Firefox). Konqueror didn't work either. I give up. Just download the thing and move on.


RS Note that read includes the trailing newline; to write such a file-string out it is best to use puts -nonewline so you don't get an extra one. Also, string map is happy with long maps, so you can avoid the for loop by just coding

set XMLmap {& & < < > > ...}
set HTMLmap {Ä Ä ...}
...
set myMap $XMLmap
...
set myText [string map $myMap $mytext]