HTML character entity references

aricb HTML character entity references are defined as part of the HTML specification: http://www.w3.org/TR/REC-html40/sgml/entities.html

The following code defines two commands, htmlentities::insertentities and htmlentities::removeentities. The first command replaces Unicode characters with HTML character entity references; the second command replaces HTML character entity references with the literal Unicode characters they represent. There's nothing fancy about the code, but hopefully it's useful to somebody.

  namespace eval htmlentities {
      variable map [list \
          \u00a0 &nbsp\;    \
          \u00a1 &iexcl\;   \
          \u00a2 &cent\;    \
          \u00a3 &pound\;   \
          \u00a4 &curren\;  \
          \u00a5 &yen\;     \
          \u00a6 &brvbar\;  \
          \u00a7 &sect\;    \
          \u00a8 &uml\;     \
          \u00a9 &copy\;    \
          \u00aa &ordf\;    \
          \u00ab &laquo\;   \
          \u00ac &not\;     \
          \u00ad &shy\;     \
          \u00ae &reg\;     \
          \u00af &macr\;    \
          \u00b0 &deg\;     \
          \u00b1 &plusmn\;  \
          \u00b2 &sup2\;    \
          \u00b3 &sup3\;    \
          \u00b4 &acute\;   \
          \u00b5 &micro\;   \
          \u00b6 &para\;    \
          \u00b7 &middot\;  \
          \u00b8 &cedil\;   \
          \u00b9 &sup1\;    \
          \u00ba &ordm\;    \
          \u00bb &raquo\;   \
          \u00bc &frac14\;  \
          \u00bd &frac12\;  \
          \u00be &frac34\;  \
          \u00bf &iquest\;  \
          \u00c0 &Agrave\;  \
          \u00c1 &Aacute\;  \
          \u00c2 &Acirc\;   \
          \u00c3 &Atilde\;  \
          \u00c4 &Auml\;    \
          \u00c5 &Aring\;   \
          \u00c6 &AElig\;   \
          \u00c7 &Ccedil\;  \
          \u00c8 &Egrave\;  \
          \u00c9 &Eacute\;  \
          \u00ca &Ecirc\;   \
          \u00cb &Euml\;    \
          \u00cc &Igrave\;  \
          \u00cd &Iacute\;  \
          \u00ce &Icirc\;   \
          \u00cf &Iuml\;    \
          \u00d0 &ETH\;     \
          \u00d1 &Ntilde\;  \
          \u00d2 &Ograve\;  \
          \u00d3 &Oacute\;  \
          \u00d4 &Ocirc\;   \
          \u00d5 &Otilde\;  \
          \u00d6 &Ouml\;    \
          \u00d7 &times\;   \
          \u00d8 &Oslash\;  \
          \u00d9 &Ugrave\;  \
          \u00da &Uacute\;  \
          \u00db &Ucirc\;   \
          \u00dc &Uuml\;    \
          \u00dd &Yacute\;  \
          \u00de &THORN\;   \
          \u00df &szlig\;   \
          \u00e0 &agrave\;  \
          \u00e1 &aacute\;  \
          \u00e2 &acirc\;   \
          \u00e3 &atilde\;  \
          \u00e4 &auml\;    \
          \u00e5 &aring\;   \
          \u00e6 &aelig\;   \
          \u00e7 &ccedil\;  \
          \u00e8 &egrave\;  \
          \u00e9 &eacute\;  \
          \u00ea &ecirc\;   \
          \u00eb &euml\;    \
          \u00ec &igrave\;  \
          \u00ed &iacute\;  \
          \u00ee &icirc\;   \
          \u00ef &iuml\;    \
          \u00f0 &eth\;     \
          \u00f1 &ntilde\;  \
          \u00f2 &ograve\;  \
          \u00f3 &oacute\;  \
          \u00f4 &ocirc\;   \
          \u00f5 &otilde\;  \
          \u00f6 &ouml\;    \
          \u00f7 &divide\;  \
          \u00f8 &oslash\;  \
          \u00f9 &ugrave\;  \
          \u00fa &uacute\;  \
          \u00fb &ucirc\;   \
          \u00fc &uuml\;    \
          \u00fd &yacute\;  \
          \u00fe &thorn\;   \
          \u00ff &yuml\;    \
          \u0192 &fnof\;    \
          \u0391 &Alpha\;   \
          \u0392 &Beta\;    \
          \u0393 &Gamma\;   \
          \u0394 &Delta\;   \
          \u0395 &Epsilon\; \
          \u0396 &Zeta\;    \
          \u0397 &Eta\;     \
          \u0398 &Theta\;   \
          \u0399 &Iota\;    \
          \u039a &Kappa\;   \
          \u039b &Lambda\;  \
          \u039c &Mu\;      \
          \u039d &Nu\;      \
          \u039e &Xi\;      \
          \u039f &Omicron\; \
          \u03a0 &Pi\;      \
          \u03a1 &Rho\;     \
          \u03a3 &Sigma\;   \
          \u03a4 &Tau\;     \
          \u03a5 &Upsilon\; \
          \u03a6 &Phi\;     \
          \u03a7 &Chi\;     \
          \u03a8 &Psi\;     \
          \u03a9 &Omega\;   \
          \u03b1 &alpha\;   \
          \u03b2 &beta\;    \
          \u03b3 &gamma\;   \
          \u03b4 &delta\;   \
          \u03b5 &epsilon\; \
          \u03b6 &zeta\;    \
          \u03b7 &eta\;     \
          \u03b8 &theta\;   \
          \u03b9 &iota\;    \
          \u03ba &kappa\;   \
          \u03bb &lambda\;  \
          \u03bc &mu\;      \
          \u03bd &nu\;      \
          \u03be &xi\;      \
          \u03bf &omicron\; \
          \u03c0 &pi\;      \
          \u03c1 &rho\;     \
          \u03c2 &sigmaf\;  \
          \u03c3 &sigma\;   \
          \u03c4 &tau\;     \
          \u03c5 &upsilon\; \
          \u03c6 &phi\;     \
          \u03c7 &chi\;     \
          \u03c8 &psi\;     \
          \u03c9 &omega\;   \
          \u03d1 &thetasym\; \
          \u03d2 &upsih\;   \
          \u03d6 &piv\;     \
          \u2022 &bull\;    \
          \u2026 &hellip\;  \
          \u2032 &prime\;   \
          \u2033 &Prime\;   \
          \u203e &oline\;   \
          \u2044 &frasl\;   \
          \u2118 &weierp\;  \
          \u2111 &image\;   \
          \u211c &real\;    \
          \u2122 &trade\;   \
          \u2135 &alefsym\; \
          \u2190 &larr\;    \
          \u2191 &uarr\;    \
          \u2192 &rarr\;    \
          \u2193 &darr\;    \
          \u2194 &harr\;    \
          \u21b5 &crarr\;   \
          \u21d0 &lArr\;    \
          \u21d1 &uArr\;    \
          \u21d2 &rArr\;    \
          \u21d3 &dArr\;    \
          \u21d4 &hArr\;    \
          \u2200 &forall\;  \
          \u2202 &part\;    \
          \u2203 &exist\;   \
          \u2205 &empty\;   \
          \u2207 &nabla\;   \
          \u2208 &isin\;    \
          \u2209 &notin\;   \
          \u220b &ni\;      \
          \u220f &prod\;    \
          \u2211 &sum\;     \
          \u2212 &minus\;   \
          \u2217 &lowast\;  \
          \u221a &radic\;   \
          \u221d &prop\;    \
          \u221e &infin\;   \
          \u2220 &ang\;     \
          \u2227 &and\;     \
          \u2228 &or\;      \
          \u2229 &cap\;     \
          \u222a &cup\;     \
          \u222b &int\;     \
          \u2234 &there4\;  \
          \u223c &sim\;     \
          \u2245 &cong\;    \
          \u2248 &asymp\;   \
          \u2260 &ne\;      \
          \u2261 &equiv\;   \
          \u2264 &le\;      \
          \u2265 &ge\;      \
          \u2282 &sub\;     \
          \u2283 &sup\;     \
          \u2284 &nsub\;    \
          \u2286 &sube\;    \
          \u2287 &supe\;    \
          \u2295 &oplus\;   \
          \u2297 &otimes\;  \
          \u22a5 &perp\;    \
          \u22c5 &sdot\;    \
          \u2308 &lceil\;   \
          \u2309 &rceil\;   \
          \u230a &lfloor\;  \
          \u230b &rfloor\;  \
          \u2329 &lang\;    \
          \u232a &rang\;    \
          \u25ca &loz\;     \
          \u2660 &spades\;  \
          \u2663 &clubs\;   \
          \u2665 &hearts\;  \
          \u2666 &diams\;   \
          \u0022 &quot\;    \
          \u0026 &amp\;     \
          \u003c &lt\;      \
          \u003e &gt\;      \
          \u0152 &OElig\;   \
          \u0153 &oelig\;   \
          \u0160 &Scaron\;  \
          \u0161 &scaron\;  \
          \u0178 &Yuml\;    \
          \u02c6 &circ\;    \
          \u02dc &tilde\;   \
          \u2002 &ensp\;    \
          \u2003 &emsp\;    \
          \u2009 &thinsp\;  \
          \u200c &zwnj\;    \
          \u200d &zwj\;     \
          \u200e &lrm\;     \
          \u200f &rlm\;     \
          \u2013 &ndash\;   \
          \u2014 &mdash\;   \
          \u2018 &lsquo\;   \
          \u2019 &rsquo\;   \
          \u201a &sbquo\;   \
          \u201c &ldquo\;   \
          \u201d &rdquo\;   \
          \u201e &bdquo\;   \
          \u2020 &dagger\;  \
          \u2021 &Dagger\;  \
          \u2030 &permil\;  \
          \u2039 &lsaquo\;  \
          \u203a &rsaquo\;  \
          \u20ac &euro\;    ]

      variable reversemap [lreverse $map]

      proc insertentities {string} {
          variable map
          return [string map $map $string]
      }

      proc removeentities {string} {
          variable reversemap
          return [string map $reversemap $string]
      }
  }

AMG: Here's the same code, written in a somewhat more compressed manner. I took the liberty of using namespace ensembles and renaming the commands.

namespace eval htmlentities {
    namespace ensemble create -subcommands {encode decode}
    variable encode {
        \u00a0 &nbsp\; \u00a1 &iexcl\; \u00a2 &cent\; \u00a3 &pound\; \u00a4
        &curren\; \u00a5 &yen\; \u00a6 &brvbar\; \u00a7 &sect\; \u00a8 &uml\;
        \u00a9 &copy\; \u00aa &ordf\; \u00ab &laquo\; \u00ac &not\; \u00ad
        &shy\; \u00ae &reg\; \u00af &macr\; \u00b0 &deg\; \u00b1 &plusmn\;
        \u00b2 &sup2\; \u00b3 &sup3\; \u00b4 &acute\; \u00b5 &micro\; \u00b6
        &para\; \u00b7 &middot\; \u00b8 &cedil\; \u00b9 &sup1\; \u00ba &ordm\;
        \u00bb &raquo\; \u00bc &frac14\; \u00bd &frac12\; \u00be &frac34\;
        \u00bf &iquest\; \u00c0 &Agrave\; \u00c1 &Aacute\; \u00c2 &Acirc\;
        \u00c3 &Atilde\; \u00c4 &Auml\; \u00c5 &Aring\; \u00c6 &AElig\; \u00c7
        &Ccedil\; \u00c8 &Egrave\; \u00c9 &Eacute\; \u00ca &Ecirc\; \u00cb
        &Euml\; \u00cc &Igrave\; \u00cd &Iacute\; \u00ce &Icirc\; \u00cf
        &Iuml\; \u00d0 &ETH\; \u00d1 &Ntilde\; \u00d2 &Ograve\; \u00d3
        &Oacute\; \u00d4 &Ocirc\; \u00d5 &Otilde\; \u00d6 &Ouml\; \u00d7
        &times\; \u00d8 &Oslash\; \u00d9 &Ugrave\; \u00da &Uacute\; \u00db
        &Ucirc\; \u00dc &Uuml\; \u00dd &Yacute\; \u00de &THORN\; \u00df
        &szlig\; \u00e0 &agrave\; \u00e1 &aacute\; \u00e2 &acirc\; \u00e3
        &atilde\; \u00e4 &auml\; \u00e5 &aring\; \u00e6 &aelig\; \u00e7
        &ccedil\; \u00e8 &egrave\; \u00e9 &eacute\; \u00ea &ecirc\; \u00eb
        &euml\; \u00ec &igrave\; \u00ed &iacute\; \u00ee &icirc\; \u00ef
        &iuml\; \u00f0 &eth\; \u00f1 &ntilde\; \u00f2 &ograve\; \u00f3
        &oacute\; \u00f4 &ocirc\; \u00f5 &otilde\; \u00f6 &ouml\; \u00f7
        &divide\; \u00f8 &oslash\; \u00f9 &ugrave\; \u00fa &uacute\; \u00fb
        &ucirc\; \u00fc &uuml\; \u00fd &yacute\; \u00fe &thorn\; \u00ff &yuml\;
        \u0192 &fnof\; \u0391 &Alpha\; \u0392 &Beta\; \u0393 &Gamma\; \u0394
        &Delta\; \u0395 &Epsilon\; \u0396 &Zeta\; \u0397 &Eta\; \u0398 &Theta\;
        \u0399 &Iota\; \u039a &Kappa\; \u039b &Lambda\; \u039c &Mu\; \u039d
        &Nu\; \u039e &Xi\; \u039f &Omicron\; \u03a0 &Pi\; \u03a1 &Rho\; \u03a3
        &Sigma\; \u03a4 &Tau\; \u03a5 &Upsilon\; \u03a6 &Phi\; \u03a7 &Chi\;
        \u03a8 &Psi\; \u03a9 &Omega\; \u03b1 &alpha\; \u03b2 &beta\; \u03b3
        &gamma\; \u03b4 &delta\; \u03b5 &epsilon\; \u03b6 &zeta\; \u03b7 &eta\;
        \u03b8 &theta\; \u03b9 &iota\; \u03ba &kappa\; \u03bb &lambda\; \u03bc
        &mu\; \u03bd &nu\; \u03be &xi\; \u03bf &omicron\; \u03c0 &pi\; \u03c1
        &rho\; \u03c2 &sigmaf\; \u03c3 &sigma\; \u03c4 &tau\; \u03c5 &upsilon\;
        \u03c6 &phi\; \u03c7 &chi\; \u03c8 &psi\; \u03c9 &omega\; \u03d1
        &thetasym\; \u03d2 &upsih\; \u03d6 &piv\; \u2022 &bull\; \u2026
        &hellip\; \u2032 &prime\; \u2033 &Prime\; \u203e &oline\; \u2044
        &frasl\; \u2118 &weierp\; \u2111 &image\; \u211c &real\; \u2122
        &trade\; \u2135 &alefsym\; \u2190 &larr\; \u2191 &uarr\; \u2192 &rarr\;
        \u2193 &darr\; \u2194 &harr\; \u21b5 &crarr\; \u21d0 &lArr\; \u21d1
        &uArr\; \u21d2 &rArr\; \u21d3 &dArr\; \u21d4 &hArr\; \u2200 &forall\;
        \u2202 &part\; \u2203 &exist\; \u2205 &empty\; \u2207 &nabla\; \u2208
        &isin\; \u2209 &notin\; \u220b &ni\; \u220f &prod\; \u2211 &sum\;
        \u2212 &minus\; \u2217 &lowast\; \u221a &radic\; \u221d &prop\; \u221e
        &infin\; \u2220 &ang\; \u2227 &and\; \u2228 &or\; \u2229 &cap\; \u222a
        &cup\; \u222b &int\; \u2234 &there4\; \u223c &sim\; \u2245 &cong\;
        \u2248 &asymp\; \u2260 &ne\; \u2261 &equiv\; \u2264 &le\; \u2265 &ge\;
        \u2282 &sub\; \u2283 &sup\; \u2284 &nsub\; \u2286 &sube\; \u2287
        &supe\; \u2295 &oplus\; \u2297 &otimes\; \u22a5 &perp\; \u22c5 &sdot\;
        \u2308 &lceil\; \u2309 &rceil\; \u230a &lfloor\; \u230b &rfloor\;
        \u2329 &lang\; \u232a &rang\; \u25ca &loz\; \u2660 &spades\; \u2663
        &clubs\; \u2665 &hearts\; \u2666 &diams\; \u0022 &quot\; \u0026 &amp\;
        \u003c &lt\; \u003e &gt\; \u0152 &OElig\; \u0153 &oelig\; \u0160
        &Scaron\; \u0161 &scaron\; \u0178 &Yuml\; \u02c6 &circ\; \u02dc
        &tilde\; \u2002 &ensp\; \u2003 &emsp\; \u2009 &thinsp\; \u200c &zwnj\;
        \u200d &zwj\; \u200e &lrm\; \u200f &rlm\; \u2013 &ndash\; \u2014
        &mdash\; \u2018 &lsquo\; \u2019 &rsquo\; \u201a &sbquo\; \u201c
        &ldquo\; \u201d &rdquo\; \u201e &bdquo\; \u2020 &dagger\; \u2021
        &Dagger\; \u2030 &permil\; \u2039 &lsaquo\; \u203a &rsaquo\; \u20ac
        &euro\;
    }
    variable decode [lreverse $encode]
    proc encode {string} {
        variable encode
        string map $encode $string
    }
    proc decode {string} {
        variable decode
        string map $decode $string
    }
}

By the way, that's a very interesting use for [lreverse]!

aricb: Regarding [lreverse], it's probably an unorthodox usage, but it works in this code due to some special properties of the mapping between Unicode characters and character entity references:

  • the mapping is a one-to-one binary relation;
  • no member of the domain is a substring of any other member of the domain; and
  • no member of the codomain is a substring of any other member of the codomain.

The practical consequence of these properties is that, as long as paired list elements are adjacent to each other, the order of the pairs is unimportant as far as [string map] is concerned. [lreverse] reorders everything, but it preserves the adjacency of paired elements, and it sticks all of the members of the original domain in odd-indexed slots, and all the members of the original codomain in even-indexed slots, which happens to be just what is needed for the reverse mapping.


AK - 2010-06-01 18:15:58

Notes:

  1. [list {*}{...}] is equivalent to {...}, i.e. an overcomplication.
  2. The commands 'set encode ...' and 'set decode ...' are subject to the creative writing problem. Use 'variable' instead of 'set'.

AMG: [list {*}{...}] and {...} are not equivalent. The former produces a pure list, whereas the latter produces a string which may or may not be a valid list. In this case, I could have done either, but I figured there's a chance the pure list may be more efficient. I admit I didn't profile timing.

The creative writing problem? I wanted the variables to be created if they didn't already exist, and overwritten if they did already exist. What am I missing?

Lars H: It is only with concat, eval, and the like that pure-listness make a difference (since these are primarily defined to operate on strings and have optimisations for the pure list case; most list operations rather work with the list internal representation, generating that from the string representation if it isn't already available). Regarding creative writing (which is perhaps a misleading name for the concept), the problem is that set won't create a htmlentities::encode variable if an ::encode variable already exists, but instead overwrite the latter.

AMG: Yeah, you're right, $encode only gets fed to [string map] which won't care if there's a string representation or not. After thinking about it a bit more, I remembered two other reasons I had used [list]. One, it was in aricb's original code. Two, in my testing I was printing out $encode to confirm all the Unicode stuff was working correctly, and I wanted all the line breaks and extra whitespace stripped from my printout; then when I was done with this testing, I didn't think to remove the call to [list]. Still, this leaves a string representation lying around in memory that won't ever get used, but sometimes that's the price we pay for dual-ported values.

I didn't know about this characteristic of namespaces. So let me get this straight: first [set] tries to find a variable called "encode", and it might find it in ::. If it finds a variable, it overwrites it rather than creating it, regardless of the namespace in which it was found. I will have to keep that in mind!


AK - 2010-06-02 11:58:14

Thanks for pointing out the non-equivalency regarding {...} and [list {*}{...}]. Even so, I stand by my point that using the [list {*}{...}] is overcomplicated. In a context like this I am not really concerned with having string rep laying around after [string map] has generated the list rep it needed from it. Its not that this is a multi-MB data structure. I.e. it feels like a micro-optimization.

AMG: Actually the list representation will be generated by [lreverse]. ;^) As I said before, I did have reasons for wanting a pure list, but they were all either invalid or only relevant during initial development.


AMG: If you look in the page history, you'll find my failed attempts to upload a UTF-8 version which avoids using \uXXXX escapes. But neither Firefox 3.6.3 nor IE8 could hack it, and I got mojibake [1 ]. I'm not sure where the problem lies. I triple-checked that I had UTF-8 encoding set in my browser, both when viewing and when editing. Funny: even though that didn't work, I'm able to paste this text in without problems: 文字化け.

Does anyone have a clue why this happened? My suspicion is that the Wiki did some kind of encoding autodetection (or sanity check) and decided to second-guess the encoding identification sent by my browser. One or more of the characters must have confused it.


HaO 2012-03-14 This implementation looks identical to tcllib html::html_entities


See also: Entities char2ent