char2ent

Difference between version 33 and 34 - Previous - Next
|What:| char2ent.tcl|
| Where:| https://pastebin.com/c2dPigpd|
| Description:| Read an HTML or XML file containing literal special characters, then replace specials with appropriate SGML entities and output.|
||        Currently at version 0.0 .|
| Updated:| 11/2004|

Contact: [LES]

----
[LES] on November 18, 2004: Get the code here: [https://pastebin.com/c2dPigpd]

Or here:


======
#!/usr/bin/env tclsh

# char2ent.tcl - Opens an HTML or XML file with special characters 
#                 (diacritics) written in plain text, replaces these special 
#                 characters with appropriate HTML or XML entities and writes the 
#                output to a new file. 
#
# Author: Luciano Espirito Santo 
#
# History 
# 
#         Version 1.0        2004-11-18        Luciano Espirito Santo
#                 First version. Alpha stage.
#
#                 KNOWN ISSUES: 
#                - No user-proof measures, no error or exception handling, no nothing! 
#                   No guarantees! Use it at your own risk!
#                - Tested on Windows 98 and Linux only.
#
#                TODO: 
#                 - Extend it so it can handle all possible characters (Unicode).
#                - Make the ability to do the INVERSE operation (that would include the 
#                  ability to tell non-escaped entities from escaped entities and NOT 
#                  replace the escaped entities.
#                - Make it handle STDIN.
#
#                LICENSE: BSD
#
# How to use it:
# 
# char2ent.tcl  --help

# ----------------------------------------------------------------
# Do not change anything below this point unless you know what you're doing.


# Print help text and exit if '--help' is the only argument if        {[llength $argc] == 1 && [lindex $argv 0] == "--help"}        {
        puts  ""
        puts  "char2ent, by Luciano Espirito Santo - 2004"
        puts  ""
        puts  {Usage: char2ent -[option]  "input file"  "output file"}
        puts  ""
        puts  "Possible options:"
        puts  "-h: convert special characters to HTML entities"
        puts  "-x: convert special characters to XML entities"
        puts  ""
        puts  {"input file" MUST exist}
        puts  {"output file" is created automatically if it does not exist}
        puts  {"input file" and "output file" MUST NOT be the same file}
        puts  ""
        puts  {Example: char2ent -x "sample.xml" "converted.xml"}
        puts  ""
        exit
}

# Complain and exit if option is neither '-h' nor '-x' if        {[lindex $argv 0] != "-h" && [lindex $argv 0] != "-x"}          {
        puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
        exit
}

# Complain and exit if not exactly 3 arguments (option, input, output) are found if        {$argc != 3}        {
        puts  "Error! You must use exactly 3 arguments.\n"
        puts  "$argv\n"
        puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
        exit
}

# Complain and exit if input file does not exist if        {! [file exists [lindex $argv 1]]}          {
        puts  "Error! File \"[lindex $argv 1]\" not found!\n"
        exit
}

# Complain and exit if input file is not readable if        {! [file readable [lindex $argv 1]]}        {
        puts  "Error! Permission denied to read [ lindex $argv 1 ]!\n"
        exit
}

# Complain if input file and output file are the same if        {[lindex $argv 1] == [lindex $argv 2]}        {
        puts  "Error! \"input file\" and \"output file\" must not be the same.\n"
        exit
}

# Try to open input file for reading. 
# Complain and exit in case of errors. if        {[catch {set IF [open [lindex $argv 1] r]} IFerror]}  {
        puts  "Error! $IFerror\n"
        exit
}

# Try to open output file for writing. 
# Complain, close input file and exit in case of errors. if        {[catch {set OF [open [lindex $argv 2] w]} OFerror]}        {
        close $IF
        puts  "Error! $OFerror\n"
        exit
}
        

# ================================================
# Two files open. No errors this far. Let's replace. 

set  CHARS  {        ª        º        À        Á        Â        Ã        Ä        Å        Æ        Ç        
        È        É        Ê        Ë        Ì        Í        Î        Ï        Ð        Ñ        
        Ò        Ó        Ô        Õ        Ö        Ø        Ù        Ú        Û        Ü        
        Ý        Þ        ß        à        á        â        ã        ä        å        æ        
        ç        è        é        ê        ë        ì        í        î        ï        ð        
        ñ        ò        ó        ô        õ        ö        ø        ù        ú        û        
        ü        ý        þ        ÿ        Œ        œ        Ÿ
}

set  HTML  {        ª        º        À        Á        Â        Ã        Ä        
        Å        &Aelig;        Ç        È        É        Ê        Ë        
        Ì        Í        Î        Ï        Ð        Ñ        Ò        
        Ó        Ô        Õ        Ö        Ø        Ù        Ú        
        Û        Ü        Ý        Þ        ß        à        á        
        â        ã        ä        å        æ        ç        è        
        é        ê        ë        ì        í        î        ï        ð        
        ñ        ò        ó        ô        õ        ö        ø        
        ù        ú        û        ü        ý        þ        ÿ        
        &Oelig;        œ        Ÿ
}

set  XML  {    ª    ªº    À    ºÁ        Àà    Ä    ÁÅ    Æ            à        Ä        Å        Æ        
    Ç    ÇÈ    É    ÈÊ    Ë    ÉÌ    Í    ÊÎ    Ï    Ë        Ì        Í        Î        Ï        
    Ð    ÐÑ    Ò    ÑÓ    Ô    ÒÕ    Ö    ÓØ    Ù    Ô        Õ        Ö        Ø        Ù        
    Ú    ÚÛ    Ü    ÛÝ    Þ    Üß    à    Ýá    â    Þ        ß        à        á        â        
    ã    ãä    å    äæ    ç    åè    é    æê    ë    ç        è        é        ê        ë        
    ì    ìí    î    íï    ð    îñ    ò    ïó    ô    ð        ñ        ò        ó        ô        
    õ    õö    ø    öù    ú    øû    ü    ùý    þ    ú        û        ü        ý        þ        
    ÿ    ÿŒ    œ    Œ        œ        ŸŸ
}


set TEXT [read $IF]
for        {set i 0}        {$i < [llength $CHARS]}        {incr i}          {

        switch -- [lindex $argv 0]        {
                "-h"        {set REPL [lindex $HTML $i]}
                "-x"        {set REPL [lindex $XML  $i]}
        }

        set TEXT [string map "[lindex $CHARS $i] $REPL" $TEXT]
}

puts -nonewline $OF $TEXT
close $IF
close $OF

exit
======



The code was posted here entirely at first, but I never succeeded in having all characters displayed correctly. Today (May 2007), I saw that many other characters were displayed incorrectly: a whole series of them displayed as question marks. I spent almost half an hour trying to correct it, but I couldn't make it work. Wikit keeps complaining about some "bogus" character and blames my browser (Firefox). Konqueror didn't work either. I give up. Just download the thing and move on.
[LES] in 2022: it seems the Wiki displays all the characters correctly now. Yay!

----
[RS] Note that [read] includes the trailing newline; to write such a file-string out it is best to use ''[puts] -nonewline'' so you don't get an extra one. Also, [string map] is happy with long maps, so you can avoid the [for] loop by just coding

======
set XMLmap {& & < < > > ...}
set HTMLmap {Ä Ä ...}
...
set myMap $XMLmap
...
set myText [string map $myMap $mytext]
======


<<categories>> Application | Characters | Dev. Tools | XML