char2ent

What: char2ent.tcl
Where: https://pastebin.com/c2dPigpd
Description: Read an HTML or XML file containing literal special characters, then replace specials with appropriate SGML entities and output.
Currently at version 0.0 .
Updated: 11/2004

Contact: LES


LES on November 18, 2004: Get the code here: [L1 ]

Or here:

#!/usr/bin/env tclsh

# char2ent.tcl - Opens an HTML or XML file with special characters 
#                 (diacritics) written in plain text, replaces these special 
#                 characters with appropriate HTML or XML entities and writes the 
#                output to a new file. 
#
# Author: Luciano Espirito Santo 
#
# History 
# 
#         Version 1.0        2004-11-18        Luciano Espirito Santo
#                 First version. Alpha stage.
#
#                 KNOWN ISSUES: 
#                - No user-proof measures, no error or exception handling, no nothing! 
#                   No guarantees! Use it at your own risk!
#                - Tested on Windows 98 and Linux only.
#
#                TODO: 
#                 - Extend it so it can handle all possible characters (Unicode).
#                - Make the ability to do the INVERSE operation (that would include the 
#                  ability to tell non-escaped entities from escaped entities and NOT 
#                  replace the escaped entities.
#                - Make it handle STDIN.
#
#                LICENSE: BSD
#
# How to use it:
# 
# char2ent.tcl  --help

# ----------------------------------------------------------------
# Do not change anything below this point unless you know what you're doing.


# Print help text and exit if '--help' is the only argument 
if  {[llength $argc] == 1 && [lindex $argv 0] == "--help"}        {
    puts  ""
    puts  "char2ent, by Luciano Espirito Santo - 2004"
    puts  ""
    puts  {Usage: char2ent -[option]  "input file"  "output file"}
    puts  ""
    puts  "Possible options:"
    puts  "-h: convert special characters to HTML entities"
    puts  "-x: convert special characters to XML entities"
    puts  ""
    puts  {"input file" MUST exist}
    puts  {"output file" is created automatically if it does not exist}
    puts  {"input file" and "output file" MUST NOT be the same file}
    puts  ""
    puts  {Example: char2ent -x "sample.xml" "converted.xml"}
    puts  ""
    exit
}

# Complain and exit if option is neither '-h' nor '-x' 
if  {[lindex $argv 0] != "-h" && [lindex $argv 0] != "-x"}  {
    puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
    exit
}

# Complain and exit if not exactly 3 arguments (option, input, output) are found 
if  {$argc != 3}  {
    puts  "Error! You must use exactly 3 arguments.\n"
    puts  "$argv\n"
    puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
    exit
}

# Complain and exit if input file does not exist 
if  {! [file exists [lindex $argv 1]]}  {
    puts  "Error! File \"[lindex $argv 1]\" not found!\n"
    exit
}

# Complain and exit if input file is not readable 
if  {! [file readable [lindex $argv 1]]}  {
    puts  "Error! Permission denied to read [ lindex $argv 1 ]!\n"
    exit
}

# Complain if input file and output file are the same 
if  {[lindex $argv 1] == [lindex $argv 2]}  {
    puts  "Error! \"input file\" and \"output file\" must not be the same.\n"
    exit
}

# Try to open input file for reading. 
# Complain and exit in case of errors. 
if  {[catch {set IF [open [lindex $argv 1] r]} IFerror]}  {
    puts  "Error! $IFerror\n"
    exit
}

# Try to open output file for writing. 
# Complain, close input file and exit in case of errors. 
if  {[catch {set OF [open [lindex $argv 2] w]} OFerror]}        {
    close $IF
    puts  "Error! $OFerror\n"
    exit
}

# ================================================
# Two files open. No errors this far. Let's replace. 

set  CHARS  {
    ª    º    À    Á    Â    Ã    Ä    Å
    Æ    Ç    È    É    Ê    Ë    Ì    Í
    Î    Ï    Ð    Ñ    Ò    Ó    Ô    Õ
    Ö    Ø    Ù    Ú    Û    Ü    Ý    Þ
    ß    à    á    â    ã    ä    å    æ
    ç    è    é    ê    ë    ì    í    î
    ï    ð    ñ    ò    ó    ô    õ    ö
    ø    ù    ú    û    ü    ý    þ    ÿ
    Œ    œ    Ÿ
}

set  HTML  {
    ª    º    À   Á   Â   Ã   Ä     Å
    &Aelig;   Ç  È   É   Ê   Ë     Ì   Í
    Î   Ï    Ð      Ñ   Ò  Ó   Ô    Õ
    Ö    Ø  Ù   Ú   Û   Ü     Ý   Þ
    ß   à  á   â    ã  ä     å    æ
    ç  è  é   ê    ë    ì   í   î
    ï    ð     ñ   ò   ó  ô    õ   ö
    ø  ù  ú   û    ü    ý   þ    ÿ
    &Oelig;   œ   Ÿ
}

set  XML  {
    ª    º    À    Á    Â    Ã    Ä    Å
    Æ    Ç    È    É    Ê    Ë    Ì    Í
    Î    Ï    Ð    Ñ    Ò    Ó    Ô    Õ
    Ö    Ø    Ù    Ú    Û    Ü    Ý    Þ
    ß    à    á    â    ã    ä    å    æ
    ç    è    é    ê    ë    ì    í    î
    ï    ð    ñ    ò    ó    ô    õ    ö
    ø    ù    ú    û    ü    ý    þ    ÿ
    Œ   œ   Ÿ
}

set TEXT [read $IF]

for  {set i 0}  {$i < [llength $CHARS]}  {incr i}  {
     switch -- [lindex $argv 0]  {
         "-h"    {set REPL [lindex $HTML $i]}
         "-x"    {set REPL [lindex $XML  $i]}
     }
    set TEXT [string map "[lindex $CHARS $i] $REPL" $TEXT]
}

puts -nonewline $OF $TEXT
close $IF
close $OF

exit

The code was posted here entirely at first, but I never succeeded in having all characters displayed correctly. Today (May 2007), I saw that many other characters were displayed incorrectly: a whole series of them displayed as question marks. I spent almost half an hour trying to correct it, but I couldn't make it work. Wikit keeps complaining about some "bogus" character and blames my browser (Firefox). Konqueror didn't work either. I give up. Just download the thing and move on.

LES in 2022: it seems the Wiki displays all the characters correctly now. Yay! But it kills all the XML characters when editing. No yay! :-(


RS Note that read includes the trailing newline; to write such a file-string out it is best to use puts -nonewline so you don't get an extra one. Also, string map is happy with long maps, so you can avoid the for loop by just coding

set XMLmap {& & < < > > ...}
set HTMLmap {Ä Ä ...}
...
set myMap $XMLmap
...
set myText [string map $myMap $mytext]

arjen - 2022-09-15 07:05:39

We will be looking into this issue: the &...; entities are displayed as the characters they represent, rather than the entities themselves.