J. Siekmeyer - up to now still in the reading/fixing typos/fighting wiki-spam state of "wikit-activism". Title: char2ent Date: 23 Feb 2005 01:09:10 GMT Site: LES,200.171.159.14 Posted by [LES] on November 18, 2004 #!/usr/local/bin/tclsh # char2ent.tcl - Opens an HTML or XML file with special characters # (diacritics) written in plain text, replaces these special # characters with appropriate HTML or XML entities and writes # the output to a new file. # # Author: Luciano Espirito Santo # # History # # Version 1.0 2004-11-18 Luciano Espirito Santo # First version. Alpha stage. # # KNOWN ISSUES: # - No user-proof measures, no error or exception handling, no nothing! # No guarantees! Use it at your own risk! # - Tested on Windows 98 only. Permission issues are likely to come up # in other operating systems. Permission issues are harmless. The # program will just not be able to read the input file and/or write # to the output file. # # TODO: # - Make the ability to do the INVERSE operation (that would include # the ability to tell non-escaped entities from escaped entities and # NOT replace the escaped entities. # - Make it handle STDIN. # # LICENSE: BSD # # How to use it: # # char2ent.tcl --help # ---------------------------------------------------------------- # Do not change anything below this point unless you know what you're doing. # Print help text and exit if '--help' is the only argument if { [ llength $argc ] == 1 && [ lindex $argv 0 ] == "--help" } { puts "" puts "char2ent, by Luciano Espirito Santo - 2004" puts "" puts {Usage: char2ent -[option] "input file" "output file"} puts "" puts "Possible options:" puts "-h: convert special characters to HTML entities" puts "-x: convert special characters to XML entities" puts "" puts {"input file" MUST exist} puts {"output file" is created automatically if it does not exist} puts {"input file" and "output file" MUST NOT be the same file} puts "" puts {Example: char2ent -x "sample.xml" "converted.xml"} puts "" exit } # Complain and exit if option is neither '-h' nor '-x' if { [ lindex $argv 0 ] != "-h" && [ lindex $argv 0 ] != "-x" } { puts "Error! Try 'char2ent --help' to see how to use this program.\n" exit } # Complain and exit if not exactly 3 arguments (option, input, output) found if { $argc != 3 } { puts "Error! Try 'char2ent --help' to see how to use this program.\n" exit } # Complain and exit if input file does not exist if {! [ file exists [ lindex $argv 1 ] ] } { puts "Error! File \"[ lindex $argv 1 ]\" not found!\n" exit } # Complain and exit if input file is not readable if {! [ file readable [ lindex $argv 1 ] ] } { puts "Error! Permission denied to read [ lindex $argv 1 ]!\n" exit } # Complain if input file and output file are the same if { [ lindex $argv 1 ] == [ lindex $argv 2 ] } { puts "Error! \"input file\" and \"output file\" must not be the same.\n" exit } # Try to open input file for reading. # Complain and exit in case of errors. if { [ catch { set myIF [ open [ lindex $argv 1 ] r ] } myIFError ] } { puts "Error! $myIFError\n" exit } # Try to open output file for writing. # Complain, close input file and exit in case of errors. if { [ catch { set myOF [ open [ lindex $argv 2 ] w ] } myOFError ] } { close $myIF puts "Error! $myOFError\n" exit } # =============================================== # Two files open. No errors this far. Let's replace. set myChars { ª º À Á �� �� Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ ¿ OE oe Ÿ } set myHTML { ª º À Á Â Ã Ä Å &Aelig; Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ &Oelig; œ Ÿ } set myXML { ª º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ Œ œ Ÿ } set myText [ read $myIF ] for { set i 0 } { $i < [ llength $myChars ] } { incr i } { switch -- [ lindex $argv 0 ] { "-h" { set myReplace [ lindex $myHTML $i ] } "-x" { set myReplace [ lindex $myXML $i ] } } set myText [ string map "[ lindex $myChars $i ] $myReplace" $myText ] } puts -nonewline $myOF $myText close $myIF close $myOF exit ---- [LES]: Unfortunately, I can't make [wikit] display the numeric codes in the "set myXML" line... Jean-Claude? ---- [RS] Note that [read] includes the trailing newline; to write such a file-string out it is best to use ''[puts] -nonewline'' so you don't get an extra one. Also, [string map] is happy with long maps, so you can avoid the [for] loop by just coding set XMLmap {& & < < > > ...} set HTMLmap {���� Ä ...} ... set myMap $XMLmap ... set myText [string map $myMap $mytext] ---- [JSI] 23feb05 The numeric codes in the "set myXML" line are preserved when requesting http://mini.net/tcl/13008.txt instead of http://mini.net/tcl/13008, this will deliver the raw text unchanged. Doesn't exactly solve the problem, but is my best idea so far. Unformatted text is used here 1) for Tcl-code (where any "help" of the wikit-parser in most cases is unwanted) and 2) for a lot of lists like [Tcl editors], where for example the automagic link generation is of good use. There has been discussion about moving Tcl-code elsewhere or using a special markup for Tcl-code. I assume and agree that searching and converting all already existing pages containing Tcl-code would generate a lot of work compared to adding ".txt" to an URL for those few pages which suffer from "parser-damaging" ;-) [LES]: Oh, boy. Someone's browser it not too good with i18n and weird characters. Messed up all of them in the '''set myChars''' line. I might have fixed now. No editor in my Linux installation seems to be capable of displaying the three last characters correctly, so they could be wrong now (DANGER! DANGER!). Boy, I'd be glad to support any "death to internationalization-challenged browsers/applications/pages" movement or campaign! This is the 21st century, for chrissake. Thank God Tcl has no problems with that. Anyway, [JSI] made interesting remarks. Perhaps we could/should start a new page called '''[[parser-damaging]]''' to discuss/expose this '''.txt''' option? Feedback appreciated. I am not so fond of going nuts and creating new pages like there is no tomorrow, like some notable neophites. ---- ---- [Category Person]