Unicode

'''[http://www.unicode.org/%|%Unicode]''' is a standard for the serial
representation of text, defining a set of code points that correspond to
components that can be composed into nearly any character used in any human
language.  [ISO] standard
[https://standards.iso.org/ittf/PubliclyAvailableStandards/c069119_ISO_IEC_10646_2017.zip%|%10646]
specifes what it calls the '''Universal Coded Character set''', or '''UCS''',
which is the identical to the Unicode character set, but Unicode
[https://www.unicode.org/faq/unicode_iso.html%|%includes additional processing
constraints] that are not found in ISO 10646.


** Terminology **

On this page, '''code point''' refers to the thing the Unicode standard calls a
'''character''', and '''character''' in turn refers to a completely-formed
character that may be composed of any number of Unicode code points, and which
in some external explanations is awkwardly referred to as a "grapheme cluster".


** See Also **

   [Unicode and UTF-8]:   

   [Unicode file reader]:   

   [A little Unicode editor]:   

   [i18n tester]:   Quickly shows what parts of Unicode are supported by a font.

   [dead keys for accents]:   A tiny package allowing easier entry of accented characters.

   [i18n - writing for the world]:   

   [some random korean text]:   



** Reference **

   [http://www.wikipedia.org/wiki/unicode%|%Wikipedia]:   

   [http://www.joelonsoftware.com/articles/Unicode.html%|%The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)], Joel Spolsky, 2003-10-08:   

   [http://www.unicode.org/faq/char_combmark.html%|%Characters and Combining Marks], [http://www.unicode.org/%|%The Unicode Consortium]:   If one has to choose only one document describing Unicode, this is the one to read.  It presents the key terms such as '''text element''', '''character''', '''code unit''', '''code point''', and '''grapheme cluster''', '''canonical equivalence''', and '''compatibility decomposition'''. 

   Unicode Standard Annex #29, [http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries%|%Unicode Text Segmentation]:   Another Unicode document that is particularly relevant to programmers.

   ISO Standard [http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012_Electronic_Inserts.zip%|%10646:2012] (and [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013_Electronic_Inserts.zip%|%electronic inserts]), Information technology -- Universal Coded Character Set (UCS):   The ISO standard that mirrors the code points in Unicode.

   ISO Amendment [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013.zip%|%10646:2012/Amd 1:2013] (and [http://standards.iso.org/ittf/PubliclyAvailableStandards/c061712_ISO_IEC_10646_2012_Amd_1_2013_Electronic_Inserts.zip%|%electronic inserts]):   The first amendment to ISO646:2012.

   [http://www.drdobbs.com/book-review-unicode-explained/199103004%|%Book Review:  Unicode Explained, by Jukka K. Korpela] ([https://web.archive.org/web/00000000000000/http://www.unixreview.com/documents/s=10102/ur0611d/ur0610d.htm%|%alternate]), [Cameron Laird]:   

   [https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme%|%What's the difference between a character, a code point, a glyph and a grapheme?], stackoverflow, 2014-2019:   



** Resources **

   [http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html%|%Unicode fonts and tools for X11]:   The classic [X] bitmap fonts in an ISO 10646-1/Unicode extension.

   [http://www.slovo.info/unifonts.htm%|%Multilingual Unicode TrueType Fonts on the Internet], Slavic Text Processing and Typography:   Links to free TrueType fonts for larger or smaller subsets of the Unicode.

   [http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=%|%Welcome to Computers and Writing Systems], [http://www.sil.org/%|%SIL International]:   Source of free Unicode fonts in various languages, with a particular focus on more obscure languages and local dialects.

   [http://www.i18nguy.com/%|%I18n Guy]:   A website dedicated to program internationalization.

   [http://www.i18nguy.com/unicode/codepages.html%|%Character Sets And Code Pages At The Push Of A Button], Tex Texin:   Code charts from all over.

   [http://billposer.org/Software/uni2ascii.html%|%ascii2uni and uni2ascii]:   Bidirectional conversion between Unicode and more than thirty 7-bit [ASCII] equivalents, including RFC 2396 URI, RFC 2045 Quoted Printable format, and the `\uXXXX` notation used in Tcl.

   [http://dejavu-fonts.org/wiki/Main_Page%|%Deja Vu Fonts]:   A set of free fonts based on the [http://ftp.gnome.org/pub/GNOME/sources/ttf-bitstream-vera/1.10/%|%Vera] fonts, and providing a wider range of characters.



** Tcl Design Considerations **

The following issue reports, [TIP%|%TIPs], and discussions provide information on how
Unicode character support is implemented in Tcl.

   [https://core.tcl-lang.org/tcl/tktview?name=0d61d3a2bb%|%Tcl interprets two adjacent surrogate code points as a character encoded using UTF-16], 2021-03-06:   An argument for defining Tcl strings as any seqence of Unicode code points, rather than limiting the alphabet to, e.g. a Unicode scalar value.

   [TIP] 597:  "string is unicode" and better utf-8/utf-16/cesu-8 encodings:   Describes the a Tcl string as a sequence of Unicode Code points, `[string is%|%string is unicode]` to determine whether a Tcl string can be encoded non-problematically into a standard Unicode encoding such as UTF-8, UTF-16, or UTF-32, and '''cesu-8''' as an encoding that, in contrast with UTF-8, UTF-16, or UTF-32, can encode any Tcl string.

   Issue report: [https://core.tcl-lang.org/tcl/tktview/821620d417%|%Different handling of surrogates in utf-8 between Tcl8 and Tcl9]:  Provides a description of the difference between Tcl 8.6.Tcl 8.7 and Tcl 9 in handling surrogate pairs.



** Description **

'''Unicode''' is a superset of
[https://en.wikipedia.org/wiki/ISO/IEC_8859-1%|%iso8859-1], which is a superset
of [ASCII].  Whereas [ASCII] defines 128 characters, and ISO8859-1 defines 191
characters,  Unicode allows code points from 0 to 0x10ffff (1,114,111), and
version 12.0 assigns code points to 137929 of them.  Unfortunately, Unicode
'''got the terminology wrong''', calling the thing represented by a code point
a '''character''' when in fact a code point might may represent things that by
themselves are not complete characters, as well as things that aren't
characters at all. That is to say, each code point represents a '''grapheme'''.
This has implications for Tcl:  A string is not a string of characters, but a
string of graphemes.

A character may be composed of any number of code points, and the number of
producible characters is larger than the number of available code points.  A
code points may be a letter, ligature, citation mark, diacritic symbol, syllable,
accent, sign, symbol used in technical transcription, some other mark,
surrogate, sentinel (control code), or even some privately defined unit of
text.

Code points are organized into '''planes''',  each containing up to 65536 code
points.  The majority of characters used in the human languages of the world
have code points between 0 and 65535, which together make up the
[https://www.unicode.org/versions/Unicode13.0.0/ch01.pdf%|%Basic Multilingual
Plane] (BMP).  Currently a default build of Tcl is only capable of handling
these code points, but work is underway to change that, and workarounds
requiring non-default build-time configuration options exist.

There are various byte-serializations for representing sequence of code points,
including [utf-8], [utf-16], and [utf-32], where 8, 16, and 32 respectively are
the minimal storage requirements for a code point.



The following things are also specified by Unicode:

   * Classes such as capitalization, and sort order.

   * Rendering hints

   * Composition of a character from individual code points, and decomposed into individual code points.

   * Normalization of code-point sequences.
   
   * Hyphenation and line-breaking.
   
   * Boundaries of words and sentences.
   
   * User interaction for processes such as text deletion and highlighting.


----

[RS], [PYK] :  Unicode versions 3.0 and earlier specified code points in the `u0000`-`uFFFD` range, known as the "basic multilingual plane" (BMP) so any code point could be stored in 16 bits.  Version 3.1 added code points in the supplementary multilingual plane (SMP, `u10000`-`u1FFFF`), supplementary ideographic plane (SIP `u20000`-`u2FFFF`), and the supplementary special purpose plane (SMP, `uE0000`-`uEFFFF`).  A code point encoded in [UTF-8] requires one to four bytes of storage:

======none
0ppppppp
110ppppp 10pppppp
1110pppp 10pppppp 10pppppp
11110ppp 10pppppp 10pppppp 10pppppp
======

where "p" is a "payload" bit.

[AMG], [PYK]: RFC 3629 [https://tools.ietf.org/html/rfc3629%|%3269] limits Unicode to \u'''0''' through \U'''10ffff''', so it's only necessary to encode 21 bits.  Consequently, a valid UTF-8 sequence can only range from 1 through 4 bytes in length.  Joel Spolsky's [https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/%|%article] is wrong about this.

Thus, bytes 0x'''f8''' and greater are illegal.  0x'''f8''' through 0x'''fb''' would have introduced a five-byte sequence.  0x'''fc''' and 0x'''fd''' would have introduced a six-byte sequence.  0x'''fe''' and 0x'''ff''' have always been forbidden

0x'''c0''' and 0x'''c1''' are also forbidden since they could only appear as the first byte of an ASCII character being encoded using two bytes, though UTF-8 requires that the shortest available encoding always be used.  Tcl intentionally breaks this rule by encoding ASCII NUL as 0x'''c0''' 0x'''80''' so that a NUL can appear in text without being interpreted as a terminator.  Even in Tcl, 0x'''c1''' is never used.

UTF-8 has 10 illegal bytes out of 256, or 3.9%.  Presumably applications (such as Tcl!) can (and do!) assign custom meaning to these bytes, but the resultant string would not be valid for data interchange.

----

[comp.lang.tcl] 2008-04:

===
Newsgroups: comp.lang.tcl
From: [email protected]
Date: Sat, 26 Apr 2008 11:55:45 -0700 (PDT)
Local: Sat, Apr 26 2008 2:55 pm 
Subject: unicode - get character representation from \uxxx notation

Hello, 
to show my problem see the following example: 

> set tcl_patchLevel 
8.5.3b1 

> set str "n\u00E4mlich" 
nämlich 

> set c 0xE4 
> set str "n\\u[format %04.4X $ch]mlich" 
n\u00E4mlich 

How do I get the \u00E4 in the character representation let's say 
iso8859-1 ? 

> encoding convertto iso8859-1 $str 


Newsgroups: comp.lang.tcl
From: [email protected]
Date: Sat, 26 Apr 2008 14:21:27 -0700 (PDT)
Local: Sat, Apr 26 2008 5:21 pm 
Subject: Re: unicode - get character representation from \uxxx notation

To convert the hex number expressed as a string 0x00e4 to a Unicode 
character, use: 

format "%c" 0x00e4 


You can then use encoding convertto to convert this to another 
encoding, e.g.: 


encoding convertto iso8859-1 [format "%c" 0x00e4]
===



** Handling code points beyond the BMP **

[LV] 2008-07-08 [PYK] 2022-07-02:

I have a request from a developer concerning whether Tcl is capable of handling code points larger than the Unicode BMP. His application was using [tdom] and it encountered the `&Ascr;` character, which is a script-A, unicode value `0x1D49C`, which tdom reports it can't handle because it is limited to UTF-8 code-unit sequences up to 3 bytes in length.

What do Tcl programmers do to properly process the code points whose utf-8 representation is larger than 3 bytes?

Note this is in an enterprise setting. Finding a solution is critical in the publishing (web or print) arena.

[RS] 2008-07-09 [pyk] 2022-07-02: Unicode out of BMP (> U+FFFF) requires a deeper rework of Tcl and Tk: We'd need 32 bits for each code point, and/or surrogate pairs. [UTF-8] at least can deal with 31-bit code points by principle.

[LV]: In July 2008 there was [http://sourceforge.net/mailarchive/forum.php?thread_name=5868906b0807090420h720d13aaxc678c9f8b1bcc045%40mail.gmail.com&forum_name=tcl-core%|%some discussion] on the [TCT] mailing list about ways that the Tcl code itself could evolve to handle things better. But for right now, users have to face either dealing with their wide Unicode via a different programming language in some way (whether converting wide characters to some other similar character, using some sort of ''macro'' representation, etc.)

[AMG] 2015 [PYK] 2022-07-02: It's been seven years since the above discussion.  What progress has been made?

tcl.h contains the comment:

    :   "Tcl is currently UCS-2 and planning UTF-16 for the Unicode string rep that `Tcl_UniChar` represents.  Changing the size of `Tcl_UniChar` is ''not'' supported."

Fast random access to code points is quite important, e.g. for regular expressions, so I don't see how standard UTF-16 meets Tcl's needs unless augmented by some kind of indexing mechanism.  Maybe the thinking is that reduced performance is acceptable for strings outside the BMP due to their assumed rarity, though I hope for logarithmic rather than linear, perhaps with some caching to further optimize the common situation of the sought-for code-point indexes being near each other.

But this is kind of a worst-of-both-worlds sort of deal.  If you're going to have to pay for variable-width representation, might as well go with UTF-8 rather than -16.

[nemanov_oo] 2022: IMHO, there is no special benefit of using even utf-32 due to combining code points. utf-8 is enough and practical choice.


** Proposed fast indexing scheme for UTF-8 **

[AMG] [PYK]: I invented a (hopefully) fast indexing scheme for UTF-8 strings, though it could certainly be adapted for UTF-16.

Instead of the current linear time UTF-16 conversion step, make an array storing the byte index of every `n`th code point other than the first.  During lookup, divide the sought-for code-point index `c` by `n`, then subtract one to get the array slot which stores the byte index for the start of the segment.  (No need to store the start of the first segment; it's always zero!)  Then scan forward one code point at a time, covering at most `n-1` code points.

For best performance, let `n` be a compile-time constant power of 2.  This allows all division and modulo operations to be implemented in terms of shifts and masks.

The most obvious optimization is to discard the indexing array if the byte count and the code point count turn out to be equal.  This means the string is ASCII, so no UTF-8 magic is required.

For compression, instead of `b` (byte index), have the array store `b-c`, i.e. the number of UTF-8 continuation bytes preceding the segment.  Upon lookup, add `c-(c%n)`.  This can reduce memory usage by letting strings with fewer continuation bytes use unsigned 8- or 16-bit array slots.  This subtraction also makes the next optimization simpler.

Take the difference between the current and next array slot values (additionally subtract `n` if not doing the above compression) to get the number of UTF-8 continuation bytes in the segment.  During the scan, decrement this value for each continuation byte encountered.  When the remaining count is zero, it's known that the remaining code points are one byte each, so jump straight to the code point.  If a segment is all ASCII to begin with, this optimization kicks in immediately.

Another potential optimization is to scan backwards from the start of the next segment (or end of string) if the code poin index modulo `n` is greater than some threshold.  Probably should put the threshold at `3*n/4` since backward UTF-8 scans have to be done one byte at a time, whereas forward scans are one whole code point at a time.  This can be combined with the above optimization.

Yet another optimization is to remember the `b` result of the previous index lookup and scan relative to it if within `n` code points forward or whatever threshold backwards.

The maximum number of array slots is `(byte_count/n)-1`, so allocation can be done right away if the byte count is known but not the code point count.  Though if the string is merely NUL-terminated and not (byte)length-counted, then it's necessary to either make two passes through the string or to allocate an initial guess then geometrically grow the allocation if the guess is short.

   * Let `s` = "`Ég get etið gler án þess að meiða mig.`"
   * Let `e` be the UTF-8 encoding of `s`.  `e` = {`c3 89 67 20 67 65 74 20 65 74 69 c3 b0 20 67 6c 65 72 20 c3 a1 6e 20 c3 be 65 73 73 20 61 c3 b0 20 6d 65 69 c3 b0 61 20 6d 69 67 2e`}
   * Let `n` = `8`.
   * Since `n` is a power of two, let `n` = `1<<p` so `p` = `3`.  Also let `m` = `(1<<p)-1` = `7`.
   * The goal is to find code point at index `c` = `22`.
   * Apply the subtraction optimization to get `f` = {`1 2 4 6`}.  This array is obtained by scanning through the string one code point at a time and recording `b-c` when `!(c&m)&&c`.
   * First, find `i` = `c>>p` = `2`.
   ** This is nonzero, so do array lookup.
   ** `b` = `f[[i-1]]` = `f[[1]]` = `2`.
   ** `c` = `c&m` = `6`.
   ** If `i` is zero, instead set `b` = `0`.
   * Next, scan forward one code point at a time until `c` is zero.
   ** At each step, decrement `c` and increment `b` by the number of leading 1-bits of `e[[b]]`.
   * This leaves `b` = `26`, which is the byte index of code point `c` = `22`.

Using the backwards scan optimization, the above could have instead started at `b` = `f[[i]]` = `4` and `c` = `n-(c&m)` = `2`, then scanned backwards.  Decrement `b` at each byte.  At each non-continuation byte (i.e. `(e[[b]]&0xc0)!=0x80`), decrement `c`.  (Be careful near the end of the string.)

If the continuation byte counting optimization is used, it's known that the segment contains only two continuation bytes because `f[[i]]-f[[i-1]]` = `4-2` = `2`.  When code points `17` "`á`" and `20` "`þ`" are found, it's also known that no continuation bytes remain, so after "`þ`", simply add the remaining `c` to `b` to get the final `b`.

I haven't thought as much about updating the fast seek index when modifying the string.  I don't see why appending to the string would invalidate the already-computed index; simply add to the end.  But inserting/replacing/deleting a substring would probably take linear time due to recomputing the index from the start of the edit.  I don't really mind that though because this operation ''already'' takes linear time due to `memmove()`.


** Unicode combining code points **

[AMG] [PYK]: How are combining code points handled?  They seem to be treated as individual code points, and they're only combined in the display.  Trouble with this is that the cursor can go between combining code points, along with similar problems like cutting a string in the middle of a character.

I wish for a way to treat a code point along with all its combining code points, i.e. a character, as a logical unit.  For instance, `[string length]` would return the number of characters, not the number of code points.  New commands would have to be defined to pick apart a character into constituent code points.  I imagine most programs will want characteers to be atomic.

Behind the scenes, Tcl could even [http://www.unicode.org/reports/tr15/tr15-41.html%|%normalize] strings, though I'm not sure whether this should be automatic, manual, or configurable.


*** ICU For Tcl ***

[SW] [PYK]: I'm working on a package exporting the [http://site.icu-project.org/%|%ICU] library's Unicode functions to Tcl. [https://github.com/shawnw/icu4tcl%|%icu4tcl] currently supports:


   * Handling code points outside of the BMP if you can get them into a Tcl string. Most of the string ensemble rewritten to work with code points for code points outside the BMP.

   * code point and character classification according to the ICU library's unicode version.

   * Locale-dependent string collation.

   * String normalization.

   * Breaking strings up into characters, words, sentences, and line break points.

   * And more. Eventually I want to provide most of ICU's functionality.

[LES] on Dec 31 2023: I would like to ask a couple of questions (or the same question in two different ways):

I have a file that is a screenshot, its filename was something I copied and pasted straight from the Web page:

   Mary_Ringwald_%F0%9F%87%BA%F0%9F%87%B8_Look_at_this_beauty_2023-11-03_11-29-56.png

I don't know if the wiki is going to display it correctly. Rigth after the Mary_Ringwald name there is a USA flag. It really looked like the USA flag on the Web page. Oh, hey, it does look like the USA flag as I paste it into the Edit box! So anyway, In my file managers, including 'ls -a' on the terminal, that is displayed as two little boxes that look like [[U]][[S]]. And that is what it looks like when I paste that into a Tcl/Tk text widget. However, if I use Tcl to operate on that file, say [[file mtime $filename]], Tcl says the file can't be found. Obviously, it chokes on the Unicode filename although it can see the file is there. Why does that happen?

Secondly, I can't insert this into a Tk text widget:

   $widget insert end [subst "\\u1F4C4"]

However, I can copy the symbol from elsewhere and paste it into a Tk text widget, and it looks correct. If Tk can display it, why can't I insert it?

[APN] There are several issues here.

   * The U+1F4C4 character is not a US flag, as per the https://www.compart.com/en/unicode/U+1F4C4%|%Unicode chart%|%. It is a "page" or "document" icon
   * You need the upper case U, not the lower case u for code points above the BMP. Unless you were trying to display U+1F4C followed by the digit 4
   * I am not clear on why you are using subst.
   * I presume you are using 8.6, not 9.0. 8.6 does not support characters above BMP without a little bit of hackery. You need to input the code point as the equivalent surrogate pair as below. This works for me on Windows Tcl 8.6.13 displaying the document icon as '''expected''', NOT the US flag.

======
$widget insert end \ud83d\udcc4
======

The surrogate pair `\ud83d\udcc4` is the UTF-16 equivalent of U+1F4C4. Note there is no subst required. As to why you could cut and paste, that is because the cut and paste operation on Windows will do the surrogate conversion for you (on Windows, not sure about other platforms). I'm still curious about the US flag you are seeing. How did you map that to 1F4C4?

On Tcl/Tk 9, no need for surrogates, you can directly use the code point

======
$widget insert end \U1F4C4
======[LES] I'm sorry I caused some confusion. The US flag and the "page" or "document" icon are two separate cases. I did mean each one separately, I didn't mean them to be the same. Your "surrogate" method doesn't work for me, but I have Tcl 8.6.6. Yes, it's old. Thank you.

<<categories>> Character | Glossary | Arts and crafts of Tcl-Tk programming | Characters | String Processing | i18n - writing for the world