char2ent

Difference between version 34 and 43 - Previous - Next
|What:| char2ent.tcl|
| Where:| https://pastebin.com/c2dPigpd|
| Description:| Read an HTML or XML file containing literal special characters, then replace specials with appropriate SGML entities and output.|
||        Currently at version 0.0 .|
| Updated:| 11/2004|

Contact: [LES]

----
[LES] on November 18, 2004: Get the code here: [https://pastebin.com/c2dPigpd]

Or here:


======
#!/usr/bin/env tclsh

# char2ent.tcl - Opens an HTML or XML file with special characters 
#                 (diacritics) written in plain text, replaces these special 
#                 characters with appropriate HTML or XML entities and writes the 
#                output to a new file. 
#
# Author: Luciano Espirito Santo 
#
# History 
# 
#         Version 1.0        2004-11-18        Luciano Espirito Santo
#                 First version. Alpha stage.
#
#                 KNOWN ISSUES: 
#                - No user-proof measures, no error or exception handling, no nothing! 
#                   No guarantees! Use it at your own risk!
#                - Tested on Windows 98 and Linux only.
#
#                TODO: 
#                 - Extend it so it can handle all possible characters (Unicode).
#                - Make the ability to do the INVERSE operation (that would include the 
#                  ability to tell non-escaped entities from escaped entities and NOT 
#                  replace the escaped entities.
#                - Make it handle STDIN.
#
#                LICENSE: BSD
#
# How to use it:
# 
# char2ent.tcl  --help

# ----------------------------------------------------------------
# Do not change anything below this point unless you know what you're doing.


# Print help text and exit if '--help' is the only argument 
if  {[llength $argc] == 1 && [lindex $argv 0] == "--help"}        {
    puts  ""
    puts  "char2ent, by Luciano Espirito Santo - 2004"
    puts  ""
    puts  {Usage: char2ent -[option]  "input file"  "output file"}
    puts  ""
    puts  "Possible options:"
    puts  "-h: convert special characters to HTML entities"
    puts  "-x: convert special characters to XML entities"
    puts  ""
    puts  {"input file" MUST exist}
    puts  {"output file" is created automatically if it does not exist}
    puts  {"input file" and "output file" MUST NOT be the same file}
    puts  ""
    puts  {Example: char2ent -x "sample.xml" "converted.xml"}
    puts  ""
    exit
}

# Complain and exit if option is neither '-h' nor '-x' 
if  {[lindex $argv 0] != "-h" && [lindex $argv 0] != "-x"}  {
    puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
    exit
}

# Complain and exit if not exactly 3 arguments (option, input, output) are found 
if  {$argc != 3}  {
    puts  "Error! You must use exactly 3 arguments.\n"
    puts  "$argv\n"
    puts  "Error! Try 'char2ent --help' to see how to use this program.\n"
    exit
}

# Complain and exit if input file does not exist 
if  {! [file exists [lindex $argv 1]]}  {
    puts  "Error! File \"[lindex $argv 1]\" not found!\n"
    exit
}

# Complain and exit if input file is not readable 
if  {! [file readable [lindex $argv 1]]}  {
    puts  "Error! Permission denied to read [ lindex $argv 1 ]!\n"
    exit
}

# Complain if input file and output file are the same 
if  {[lindex $argv 1] == [lindex $argv 2]}  {
    puts  "Error! \"input file\" and \"output file\" must not be the same.\n"
    exit
}

# Try to open input file for reading. 
# Complain and exit in case of errors. 
if  {[catch {set IF [open [lindex $argv 1] r]} IFerror]}  {
    puts  "Error! $IFerror\n"
    exit
}

# Try to open output file for writing. 
# Complain, close input file and exit in case of errors. 
if  {[catch {set OF [open [lindex $argv 2] w]} OFerror]}        {
    close $IF
    puts  "Error! $OFerror\n"
    exit
}        

# ================================================
# Two files open. No errors this far. Let's replace. 

set  CHARS  {    ª    º    À    Á    Â    Ã    Ä    Å
    Æ    Ç        
    È    É    Ê    Ë    Ì    Í
    Î    Ï    Ð    Ñ        
    Ò    Ó    Ô    Õ
    Ö    Ø    Ù    Ú    Û    Ü        
    Ý    Þ
    ß    à    á    â    ã    ä    å    æ        
    ç    è    é    ê    ë    ì    í    î
    ï    ð        
    ñ    ò    ó    ô    õ    ö
    ø    ù    ú    û        
    ü    ý    þ    ÿ
    Œ    œ    Ÿ
}

set  HTML  {    ª    º    À    Á    Â    Ã    Ä        
    Å
    &Aelig;   Ç    È    É   Ê     Ë        
    Ì   Í
    Î     Ï      Ð      Ñ    Ò        
    Ó   Ô    Õ
    Ö      Ø   Ù    Ú        
    Û   Ü     Ý    Þ
     ß    à    á        
    â    ã  ä      å     æ
    ç    è        
    é   ê    ë      ì    í   î
     ï      ð        
    ñ   ò   ó    ô     õ   ö
      ø        
    ù  ú   û     ü      ý   þ     ÿ        
    &Oelig;   œ   Ÿ
}

set  XML  {    ªª    ºº    ÀÀ    ÁÁ        Ãà    ÄÄ    ÅÅ
    ÆÆ    Ç    
È    ÇÉ    ÈÊ    ÉË    ÊÌ    ËÍ
    ÌÎ    ÍÏ    ÎÐ    ÏÑ    Ò    
Ó    ÐÔ    ÑÕ
    ÒÖ    ÓØ    ÔÙ    ÕÚ    ÖÛ    ØÜ    ÙÝ        Þ
    Úß    Ûà    Üá    Ýâ    Þã    ßä    àå    á    â        æ
    ãç    äè    åé    æê    çë    èì    éí    ê    ë        î
    ìï    íð    îñ    ïò    ðó    ñô    òõ    ó    ô        ö
    õø    öù    øú    ùû    úü    ûý    üþ    ý    þ        ÿ
    ÿŒ   œ Œ    œ    ŸŸ
}

set TEXT [read $IF]

for  {set i 0}  {$i < [llength $CHARS]}  {incr i}  {
     switch -- [lindex $argv 0]  {
         "-h"    {set REPL [lindex $HTML $i]}
         "-x"    {set REPL [lindex $XML  $i]}
     }
    set TEXT [string map "[lindex $CHARS $i] $REPL" $TEXT]
}

puts -nonewline $OF $TEXT
close $IF
close $OF

exit
======



The code was posted here entirely at first, but I never succeeded in having all characters displayed correctly. Today (May 2007), I saw that many other characters were displayed incorrectly: a whole series of them displayed as question marks. I spent almost half an hour trying to correct it, but I couldn't make it work. Wikit keeps complaining about some "bogus" character and blames my browser (Firefox). Konqueror didn't work either. I give up. Just download the thing and move on.

[LES] in 2022: it seems the Wiki displays all the characters correctly now. Yay!But it kills all the XML characters when editing. No yay! :-(

----
[RS] Note that [read] includes the trailing newline; to write such a file-string out it is best to use ''[puts] -nonewline'' so you don't get an extra one. Also, [string map] is happy with long maps, so you can avoid the [for] loop by just coding

======
set XMLmap {& & < < > > ...}
set HTMLmap {Ä Ä ...}
...
set myMap $XMLmap
...
set myText [string map $myMap $mytext]
======

----
'''[arjen] - 2022-09-15 07:05:39'''

We will be looking into this issue: the &...; entities are displayed as the characters they represent, rather than the entities themselves.

<<categories>> Application | Characters | Dev. Tools | XML