aricb HTML character entity references are defined as part of the HTML specification: http://www.w3.org/TR/REC-html40/sgml/entities.html
The following code defines two commands, htmlentities::insertentities and htmlentities::removeentities. The first command replaces Unicode characters with HTML character entity references; the second command replaces HTML character entity references with the literal Unicode characters they represent. There's nothing fancy about the code, but hopefully it's useful to somebody.
namespace eval htmlentities { variable map [list \ \u00a0  \; \ \u00a1 ¡\; \ \u00a2 ¢\; \ \u00a3 £\; \ \u00a4 ¤\; \ \u00a5 ¥\; \ \u00a6 ¦\; \ \u00a7 §\; \ \u00a8 ¨\; \ \u00a9 ©\; \ \u00aa ª\; \ \u00ab «\; \ \u00ac ¬\; \ \u00ad ­\; \ \u00ae ®\; \ \u00af ¯\; \ \u00b0 °\; \ \u00b1 ±\; \ \u00b2 ²\; \ \u00b3 ³\; \ \u00b4 ´\; \ \u00b5 µ\; \ \u00b6 ¶\; \ \u00b7 ·\; \ \u00b8 ¸\; \ \u00b9 ¹\; \ \u00ba º\; \ \u00bb »\; \ \u00bc ¼\; \ \u00bd ½\; \ \u00be ¾\; \ \u00bf ¿\; \ \u00c0 À\; \ \u00c1 Á\; \ \u00c2 Â\; \ \u00c3 Ã\; \ \u00c4 Ä\; \ \u00c5 Å\; \ \u00c6 Æ\; \ \u00c7 Ç\; \ \u00c8 È\; \ \u00c9 É\; \ \u00ca Ê\; \ \u00cb Ë\; \ \u00cc Ì\; \ \u00cd Í\; \ \u00ce Î\; \ \u00cf Ï\; \ \u00d0 Ð\; \ \u00d1 Ñ\; \ \u00d2 Ò\; \ \u00d3 Ó\; \ \u00d4 Ô\; \ \u00d5 Õ\; \ \u00d6 Ö\; \ \u00d7 ×\; \ \u00d8 Ø\; \ \u00d9 Ù\; \ \u00da Ú\; \ \u00db Û\; \ \u00dc Ü\; \ \u00dd Ý\; \ \u00de Þ\; \ \u00df ß\; \ \u00e0 à\; \ \u00e1 á\; \ \u00e2 â\; \ \u00e3 ã\; \ \u00e4 ä\; \ \u00e5 å\; \ \u00e6 æ\; \ \u00e7 ç\; \ \u00e8 è\; \ \u00e9 é\; \ \u00ea ê\; \ \u00eb ë\; \ \u00ec ì\; \ \u00ed í\; \ \u00ee î\; \ \u00ef ï\; \ \u00f0 ð\; \ \u00f1 ñ\; \ \u00f2 ò\; \ \u00f3 ó\; \ \u00f4 ô\; \ \u00f5 õ\; \ \u00f6 ö\; \ \u00f7 ÷\; \ \u00f8 ø\; \ \u00f9 ù\; \ \u00fa ú\; \ \u00fb û\; \ \u00fc ü\; \ \u00fd ý\; \ \u00fe þ\; \ \u00ff ÿ\; \ \u0192 &fnof\; \ \u0391 &Alpha\; \ \u0392 &Beta\; \ \u0393 &Gamma\; \ \u0394 &Delta\; \ \u0395 &Epsilon\; \ \u0396 &Zeta\; \ \u0397 &Eta\; \ \u0398 &Theta\; \ \u0399 &Iota\; \ \u039a &Kappa\; \ \u039b &Lambda\; \ \u039c &Mu\; \ \u039d &Nu\; \ \u039e &Xi\; \ \u039f &Omicron\; \ \u03a0 &Pi\; \ \u03a1 &Rho\; \ \u03a3 &Sigma\; \ \u03a4 &Tau\; \ \u03a5 &Upsilon\; \ \u03a6 &Phi\; \ \u03a7 &Chi\; \ \u03a8 &Psi\; \ \u03a9 &Omega\; \ \u03b1 &alpha\; \ \u03b2 &beta\; \ \u03b3 &gamma\; \ \u03b4 &delta\; \ \u03b5 &epsilon\; \ \u03b6 &zeta\; \ \u03b7 &eta\; \ \u03b8 &theta\; \ \u03b9 &iota\; \ \u03ba &kappa\; \ \u03bb &lambda\; \ \u03bc &mu\; \ \u03bd &nu\; \ \u03be &xi\; \ \u03bf &omicron\; \ \u03c0 &pi\; \ \u03c1 &rho\; \ \u03c2 &sigmaf\; \ \u03c3 &sigma\; \ \u03c4 &tau\; \ \u03c5 &upsilon\; \ \u03c6 &phi\; \ \u03c7 &chi\; \ \u03c8 &psi\; \ \u03c9 &omega\; \ \u03d1 &thetasym\; \ \u03d2 &upsih\; \ \u03d6 &piv\; \ \u2022 &bull\; \ \u2026 &hellip\; \ \u2032 &prime\; \ \u2033 &Prime\; \ \u203e &oline\; \ \u2044 &frasl\; \ \u2118 &weierp\; \ \u2111 &image\; \ \u211c &real\; \ \u2122 &trade\; \ \u2135 &alefsym\; \ \u2190 &larr\; \ \u2191 &uarr\; \ \u2192 &rarr\; \ \u2193 &darr\; \ \u2194 &harr\; \ \u21b5 &crarr\; \ \u21d0 &lArr\; \ \u21d1 &uArr\; \ \u21d2 &rArr\; \ \u21d3 &dArr\; \ \u21d4 &hArr\; \ \u2200 &forall\; \ \u2202 &part\; \ \u2203 &exist\; \ \u2205 &empty\; \ \u2207 &nabla\; \ \u2208 &isin\; \ \u2209 ¬in\; \ \u220b &ni\; \ \u220f &prod\; \ \u2211 &sum\; \ \u2212 &minus\; \ \u2217 &lowast\; \ \u221a &radic\; \ \u221d &prop\; \ \u221e &infin\; \ \u2220 &ang\; \ \u2227 &and\; \ \u2228 &or\; \ \u2229 &cap\; \ \u222a &cup\; \ \u222b &int\; \ \u2234 &there4\; \ \u223c &sim\; \ \u2245 &cong\; \ \u2248 &asymp\; \ \u2260 &ne\; \ \u2261 &equiv\; \ \u2264 &le\; \ \u2265 &ge\; \ \u2282 &sub\; \ \u2283 &sup\; \ \u2284 &nsub\; \ \u2286 &sube\; \ \u2287 &supe\; \ \u2295 &oplus\; \ \u2297 &otimes\; \ \u22a5 &perp\; \ \u22c5 &sdot\; \ \u2308 &lceil\; \ \u2309 &rceil\; \ \u230a &lfloor\; \ \u230b &rfloor\; \ \u2329 &lang\; \ \u232a &rang\; \ \u25ca &loz\; \ \u2660 &spades\; \ \u2663 &clubs\; \ \u2665 &hearts\; \ \u2666 &diams\; \ \u0022 "\; \ \u0026 &\; \ \u003c <\; \ \u003e >\; \ \u0152 &OElig\; \ \u0153 &oelig\; \ \u0160 &Scaron\; \ \u0161 &scaron\; \ \u0178 &Yuml\; \ \u02c6 &circ\; \ \u02dc &tilde\; \ \u2002 &ensp\; \ \u2003 &emsp\; \ \u2009 &thinsp\; \ \u200c &zwnj\; \ \u200d &zwj\; \ \u200e &lrm\; \ \u200f &rlm\; \ \u2013 &ndash\; \ \u2014 &mdash\; \ \u2018 &lsquo\; \ \u2019 &rsquo\; \ \u201a &sbquo\; \ \u201c &ldquo\; \ \u201d &rdquo\; \ \u201e &bdquo\; \ \u2020 &dagger\; \ \u2021 &Dagger\; \ \u2030 &permil\; \ \u2039 &lsaquo\; \ \u203a &rsaquo\; \ \u20ac &euro\; ] variable reversemap [lreverse $map] proc insertentities {string} { variable map return [string map $map $string] } proc removeentities {string} { variable reversemap return [string map $reversemap $string] } }
AMG: Here's the same code, written in a somewhat more compressed manner. I took the liberty of using namespace ensembles and renaming the commands.
namespace eval htmlentities { namespace ensemble create -subcommands {encode decode} variable encode { \u00a0  \; \u00a1 ¡\; \u00a2 ¢\; \u00a3 £\; \u00a4 ¤\; \u00a5 ¥\; \u00a6 ¦\; \u00a7 §\; \u00a8 ¨\; \u00a9 ©\; \u00aa ª\; \u00ab «\; \u00ac ¬\; \u00ad ­\; \u00ae ®\; \u00af ¯\; \u00b0 °\; \u00b1 ±\; \u00b2 ²\; \u00b3 ³\; \u00b4 ´\; \u00b5 µ\; \u00b6 ¶\; \u00b7 ·\; \u00b8 ¸\; \u00b9 ¹\; \u00ba º\; \u00bb »\; \u00bc ¼\; \u00bd ½\; \u00be ¾\; \u00bf ¿\; \u00c0 À\; \u00c1 Á\; \u00c2 Â\; \u00c3 Ã\; \u00c4 Ä\; \u00c5 Å\; \u00c6 Æ\; \u00c7 Ç\; \u00c8 È\; \u00c9 É\; \u00ca Ê\; \u00cb Ë\; \u00cc Ì\; \u00cd Í\; \u00ce Î\; \u00cf Ï\; \u00d0 Ð\; \u00d1 Ñ\; \u00d2 Ò\; \u00d3 Ó\; \u00d4 Ô\; \u00d5 Õ\; \u00d6 Ö\; \u00d7 ×\; \u00d8 Ø\; \u00d9 Ù\; \u00da Ú\; \u00db Û\; \u00dc Ü\; \u00dd Ý\; \u00de Þ\; \u00df ß\; \u00e0 à\; \u00e1 á\; \u00e2 â\; \u00e3 ã\; \u00e4 ä\; \u00e5 å\; \u00e6 æ\; \u00e7 ç\; \u00e8 è\; \u00e9 é\; \u00ea ê\; \u00eb ë\; \u00ec ì\; \u00ed í\; \u00ee î\; \u00ef ï\; \u00f0 ð\; \u00f1 ñ\; \u00f2 ò\; \u00f3 ó\; \u00f4 ô\; \u00f5 õ\; \u00f6 ö\; \u00f7 ÷\; \u00f8 ø\; \u00f9 ù\; \u00fa ú\; \u00fb û\; \u00fc ü\; \u00fd ý\; \u00fe þ\; \u00ff ÿ\; \u0192 &fnof\; \u0391 &Alpha\; \u0392 &Beta\; \u0393 &Gamma\; \u0394 &Delta\; \u0395 &Epsilon\; \u0396 &Zeta\; \u0397 &Eta\; \u0398 &Theta\; \u0399 &Iota\; \u039a &Kappa\; \u039b &Lambda\; \u039c &Mu\; \u039d &Nu\; \u039e &Xi\; \u039f &Omicron\; \u03a0 &Pi\; \u03a1 &Rho\; \u03a3 &Sigma\; \u03a4 &Tau\; \u03a5 &Upsilon\; \u03a6 &Phi\; \u03a7 &Chi\; \u03a8 &Psi\; \u03a9 &Omega\; \u03b1 &alpha\; \u03b2 &beta\; \u03b3 &gamma\; \u03b4 &delta\; \u03b5 &epsilon\; \u03b6 &zeta\; \u03b7 &eta\; \u03b8 &theta\; \u03b9 &iota\; \u03ba &kappa\; \u03bb &lambda\; \u03bc &mu\; \u03bd &nu\; \u03be &xi\; \u03bf &omicron\; \u03c0 &pi\; \u03c1 &rho\; \u03c2 &sigmaf\; \u03c3 &sigma\; \u03c4 &tau\; \u03c5 &upsilon\; \u03c6 &phi\; \u03c7 &chi\; \u03c8 &psi\; \u03c9 &omega\; \u03d1 &thetasym\; \u03d2 &upsih\; \u03d6 &piv\; \u2022 &bull\; \u2026 &hellip\; \u2032 &prime\; \u2033 &Prime\; \u203e &oline\; \u2044 &frasl\; \u2118 &weierp\; \u2111 &image\; \u211c &real\; \u2122 &trade\; \u2135 &alefsym\; \u2190 &larr\; \u2191 &uarr\; \u2192 &rarr\; \u2193 &darr\; \u2194 &harr\; \u21b5 &crarr\; \u21d0 &lArr\; \u21d1 &uArr\; \u21d2 &rArr\; \u21d3 &dArr\; \u21d4 &hArr\; \u2200 &forall\; \u2202 &part\; \u2203 &exist\; \u2205 &empty\; \u2207 &nabla\; \u2208 &isin\; \u2209 ¬in\; \u220b &ni\; \u220f &prod\; \u2211 &sum\; \u2212 &minus\; \u2217 &lowast\; \u221a &radic\; \u221d &prop\; \u221e &infin\; \u2220 &ang\; \u2227 &and\; \u2228 &or\; \u2229 &cap\; \u222a &cup\; \u222b &int\; \u2234 &there4\; \u223c &sim\; \u2245 &cong\; \u2248 &asymp\; \u2260 &ne\; \u2261 &equiv\; \u2264 &le\; \u2265 &ge\; \u2282 &sub\; \u2283 &sup\; \u2284 &nsub\; \u2286 &sube\; \u2287 &supe\; \u2295 &oplus\; \u2297 &otimes\; \u22a5 &perp\; \u22c5 &sdot\; \u2308 &lceil\; \u2309 &rceil\; \u230a &lfloor\; \u230b &rfloor\; \u2329 &lang\; \u232a &rang\; \u25ca &loz\; \u2660 &spades\; \u2663 &clubs\; \u2665 &hearts\; \u2666 &diams\; \u0022 "\; \u0026 &\; \u003c <\; \u003e >\; \u0152 &OElig\; \u0153 &oelig\; \u0160 &Scaron\; \u0161 &scaron\; \u0178 &Yuml\; \u02c6 &circ\; \u02dc &tilde\; \u2002 &ensp\; \u2003 &emsp\; \u2009 &thinsp\; \u200c &zwnj\; \u200d &zwj\; \u200e &lrm\; \u200f &rlm\; \u2013 &ndash\; \u2014 &mdash\; \u2018 &lsquo\; \u2019 &rsquo\; \u201a &sbquo\; \u201c &ldquo\; \u201d &rdquo\; \u201e &bdquo\; \u2020 &dagger\; \u2021 &Dagger\; \u2030 &permil\; \u2039 &lsaquo\; \u203a &rsaquo\; \u20ac &euro\; } variable decode [lreverse $encode] proc encode {string} { variable encode string map $encode $string } proc decode {string} { variable decode string map $decode $string } }
By the way, that's a very interesting use for [lreverse]!
aricb: Regarding [lreverse], it's probably an unorthodox usage, but it works in this code due to some special properties of the mapping between Unicode characters and character entity references:
The practical consequence of these properties is that, as long as paired list elements are adjacent to each other, the order of the pairs is unimportant as far as [string map] is concerned. [lreverse] reorders everything, but it preserves the adjacency of paired elements, and it sticks all of the members of the original domain in odd-indexed slots, and all the members of the original codomain in even-indexed slots, which happens to be just what is needed for the reverse mapping.
AK - 2010-06-01 18:15:58
Notes:
AMG: [list {*}{...}] and {...} are not equivalent. The former produces a pure list, whereas the latter produces a string which may or may not be a valid list. In this case, I could have done either, but I figured there's a chance the pure list may be more efficient. I admit I didn't profile timing.
The creative writing problem? I wanted the variables to be created if they didn't already exist, and overwritten if they did already exist. What am I missing?
Lars H: It is only with concat, eval, and the like that pure-listness make a difference (since these are primarily defined to operate on strings and have optimisations for the pure list case; most list operations rather work with the list internal representation, generating that from the string representation if it isn't already available). Regarding creative writing (which is perhaps a misleading name for the concept), the problem is that set won't create a htmlentities::encode variable if an ::encode variable already exists, but instead overwrite the latter.
AMG: Yeah, you're right, $encode only gets fed to [string map] which won't care if there's a string representation or not. After thinking about it a bit more, I remembered two other reasons I had used [list]. One, it was in aricb's original code. Two, in my testing I was printing out $encode to confirm all the Unicode stuff was working correctly, and I wanted all the line breaks and extra whitespace stripped from my printout; then when I was done with this testing, I didn't think to remove the call to [list]. Still, this leaves a string representation lying around in memory that won't ever get used, but sometimes that's the price we pay for dual-ported values.
I didn't know about this characteristic of namespaces. So let me get this straight: first [set] tries to find a variable called "encode", and it might find it in ::. If it finds a variable, it overwrites it rather than creating it, regardless of the namespace in which it was found. I will have to keep that in mind!
AK - 2010-06-02 11:58:14
Thanks for pointing out the non-equivalency regarding {...} and [list {*}{...}]. Even so, I stand by my point that using the [list {*}{...}] is overcomplicated. In a context like this I am not really concerned with having string rep laying around after [string map] has generated the list rep it needed from it. Its not that this is a multi-MB data structure. I.e. it feels like a micro-optimization.
AMG: Actually the list representation will be generated by [lreverse]. ;^) As I said before, I did have reasons for wanting a pure list, but they were all either invalid or only relevant during initial development.
AMG: If you look in the page history, you'll find my failed attempts to upload a UTF-8 version which avoids using \uXXXX escapes. But neither Firefox 3.6.3 nor IE8 could hack it, and I got mojibake [L1 ]. I'm not sure where the problem lies. I triple-checked that I had UTF-8 encoding set in my browser, both when viewing and when editing. Funny: even though that didn't work, I'm able to paste this text in without problems: 文字化け.
Does anyone have a clue why this happened? My suspicion is that the Wiki did some kind of encoding autodetection (or sanity check) and decided to second-guess the encoding identification sent by my browser. One or more of the characters must have confused it.
HaO 2012-03-14 This implementation looks identical to tcllib html::html_entities