'''[http://www.unicode.org/%|%Unicode]''' is a standard for coding multingual text.  ISO has standardized a portion of Unicode as ISO646


** Reference **

   [http://www.wikipedia.org/wiki/unicode%|%Wikipedia]:   

   [http://www.joelonsoftware.com/articles/Unicode.html%|%The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)], Joel Spolsky, 2003-10-08:   

   [http://www.unicode.org/faq/char_combmark.html%|%Characters and Combining Marks], [http://www.unicode.org/%|%The Unicode Consortium]:   If a programmer is going to read only one document summarizing Unicode this is the one to read.  It presents the key terms, such as '''text element''', '''character''', '''code unit''', '''code point''', and '''grapheme cluster''', '''canonical equivalence''', and '''compatibility decomposition'''. 

   Unicode Standard Annex #29, [http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries%|%Unicode Text Segmentation]:   another Unicode document that is particularly relevant to programmers.

   ISO Standard [http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012_Electronic_Inserts.zip%|%10646:2012] (and [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013_Electronic_Inserts.zip%|%electronic inserts]), Information technology -- Universal Coded Character Set (UCS):   The ISO standard based on the character coding portions of Unicode 

   ISO Amendment [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013.zip%|%10646:2012/Amd 1:2013] (and [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013_Electronic_Inserts.zip%|%electronic inserts]):   The first amendment to ISO646:2012

   [http://www.drdobbs.com/book-review-unicode-explained/199103004%|%Book Review:  Unicode Explained, by Jukka K. Korpela] ([https://web.archive.org/web/00000000000000/http://www.unixreview.com/documents/s=10102/ur0611d/ur0610d.htm%|%alternate]), [Cameron Laird]:   


** See Also **

   [Unicode and UTF-8]:   

   [Unicode file reader]:   

   [A little Unicode editor]:   

   [i18n tester]:   quickly shows what parts of Unicode are supported by your fonts
   [dead keys for accents]:   a tiny package allowing easier entering of accented characters

   [i18n - writing for the world]:   

   [some random korean text]:   


** Resources **

   [http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html%|%Unicode fonts and tools for X11]:   The classic [X] bitmap fonts in an ISO 10646-1/Unicode extension

   [http://www.slovo.info/unifonts.htm%|%Multilingual Unicode TrueType Fonts on the Internet], Slavic Text Processing and Typography:   links to free TrueType fonts for larger or smaller subsets of the Unicode.

   [http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=%|%Welcome to Computers and Writing Systems], [http://www.sil.org/%|%SIL International]:   source of free Unicode fonts in various languages, with a particular focus on more obscure languages and local dialects.

   [http://www.i18nguy.com/%|%I18n Guy]:   a website dedicated to program internationalization

   [http://www.i18nguy.com/unicode/codepages.html%|%Character Sets And Code Pages At The Push Of A Button], Tex Texin:   code charts from all over.

   [http://billposer.org/Software/uni2ascii.html%|%ascii2uni and uni2ascii]:   bidirectional conversion between Unicode and more than thirty 7-bit [ASCII] equivalents, including RFC 2396 URI, RFC 2045 Quoted Printable format, and the \uXXXX notation used in Tcl

   [http://dejavu-fonts.org/wiki/Main_Page%|%Deja Vu Fonts]:   a set of free fonts based on the [http://ftp.gnome.org/pub/GNOME/sources/ttf-bitstream-vera/1.10/%|%Vera] fonts, and providing a wider range of characters


** Description **


Unicode is complex.  Whereas [ASCII] defines 127 characters, Unicode defines
1,114,112 code points, and characters are composed of one or more code points.
Unicode provides code charts, but doesn't stop there.  The following things are
also specified by Unicode:

   * character classes such as capitalization, and sort order

   * Rendering hints

   * composition of characters from individual codepoints, and decomposed into individual code points.

   * normalization of code-point sequences
   
   * hyphenation and line-breaking
   
   * boundaries of words and sentences
   
   * user interaction for processes such as text deletion and highlighting


----

[RS]: Until version 3.0, 16 bits (\u0000-\uFFFD: the "Basic Multilingual Plane", BMP) were sufficient for any Unicode. From 3.1, we must expect longer codes - up to 31 bits long, as specified in ISO 10646. Why 31 bits? Because that is the maximum that can be expressed in [UTF-8]: 6 bytes, omitting the taboo values \xFE and \xFF.

======none
1111110a 10aaaaaa 10bbbbbb 10bbcccc 10ccccdd 10dddddd
======

, where small letters stand for "payload" bits of bytes a..d, highestmost has only 7 bits

----

[comp.lang.tcl] 2008-04:

===
Newsgroups: comp.lang.tcl
From: r_haer...@gmx.de
Date: Sat, 26 Apr 2008 11:55:45 -0700 (PDT)
Local: Sat, Apr 26 2008 2:55 pm 
Subject: unicode - get character representation from \uxxx notation

Hello, 
to show my problem see the following example: 

> set tcl_patchLevel 
8.5.3b1 

> set str "n\u00E4mlich" 
nämlich 

> set c 0xE4 
> set str "n\\u[format %04.4X $ch]mlich" 
n\u00E4mlich 

How do I get the \u00E4 in the character representation let's say 
iso8859-1 ? 

> encoding convertto iso8859-1 $str 


Newsgroups: comp.lang.tcl
From: billpo...@alum.mit.edu
Date: Sat, 26 Apr 2008 14:21:27 -0700 (PDT)
Local: Sat, Apr 26 2008 5:21 pm 
Subject: Re: unicode - get character representation from \uxxx notation

To convert the hex number expressed as a string 0x00e4 to a Unicode 
character, use: 

format "%c" 0x00e4 


You can then use encoding convertto to convert this to another 
encoding, e.g.: 


encoding convertto iso8859-1 [format "%c" 0x00e4]
===

----
[LV] 2008-07-08:

I've a request from a developer concerning whether Tcl is capable of handling characters larger than the Unicode BMP. His application was using [tdom] and it encountered the `&Ascr;` character, which is a script-A, unicode value 0x1D49C, which tdom reports it can't handle because it is limited to UTF-8 chars up to 3 bytes in length.

What do Tcl programmers do to properly process the longer characters?

Note this is in an enterprise setting. Finding a solution is critical in the publishing (web or print) arena.

[RS] 2008-07-09: Unicode out of BMP (> U+FFFF) requires a deeper rework of Tcl and Tk: we'd need 32 bit chars and/or surrogate pairs. [UTF-8] at least can deal with 31-bit Unicodes by principle.

[LV] During July, 2008, there was some discussion in the [TCT] mailing list [http://sourceforge.net/mailarchive/forum.php?thread_name=5868906b0807090420h720d13aaxc678c9f8b1bcc045%40mail.gmail.com&forum_name=tcl-core] (let's see how long that URL lasts...) about ways that the Tcl code itself could evolve to handle things better. But for right now, users have to face either dealing with their wide unicode via a different programming language in some way (whether converting wide characters to some other similar character, using some sort of ''macro'' representation, etc.)

----
[AMG], 2015: It's been seven years since the above discussion.  What progress has been made?

tcl.h contains the comment:

    :   "Tcl is currently UCS-2 and planning UTF-16 for the Unicode string rep that `Tcl_UniChar` represents.  Changing the size of `Tcl_UniChar` is ''not'' supported."

Fast random access to characters is quite important, e.g. for regular expressions, so I don't see how standard UTF-16 meets Tcl's needs unless augmented by some kind of indexing mechanism.  Maybe the thought is reduced performance is acceptable for strings outside the BMP due to their assumed rarity, though I hope for logarithmic rather than linear, perhaps with some caching to further optimize the common situation of the sought-for character indexes being near each other.

But this is kind of a worst-of-both-worlds sort of deal.  If you're going to have to pay for variable-width representation, might as well go with UTF-8 rather than -16.

----
[AMG]: How are combining characters handled?  They seem to be treated as individual characters, and they're only combined in the display.  Trouble with this is that the cursor can go between combining characters, along with similar problems like cutting a string in the middle of what's called a grapheme cluster.

<<categories>> Characters | Glossary | Arts and crafts of Tcl-Tk programming