Version 75 of Unicode

Unicode is a standard for coding multilingual text. ISO has standardized a portion of Unicode as ISO646

Reference

Wikipedia

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) , Joel Spolsky, 2003-10-08

Characters and Combining Marks , The Unicode Consortium: If a one has to choose only one document describing Unicode, this is the one to read. It presents the key terms such as text element, character, code unit, code point, and grapheme cluster, canonical equivalence, and compatibility decomposition.

Unicode Standard Annex #29, Unicode Text Segmentation: Another Unicode document that is particularly relevant to programmers.

ISO Standard 10646:2012 (and electronic inserts ), Information technology -- Universal Coded Character Set (UCS): The ISO standard based on the character coding portions of Unicode.

ISO Amendment 10646:2012/Amd 1:2013 (and electronic inserts ): The first amendment to ISO646:2012.

Book Review: Unicode Explained, by Jukka K. Korpela (alternate ), Cameron Laird

Resources

Unicode fonts and tools for X11: The classic X bitmap fonts in an ISO 10646-1/Unicode extension.

Multilingual Unicode TrueType Fonts on the Internet , Slavic Text Processing and Typography: Links to free TrueType fonts for larger or smaller subsets of the Unicode.

Welcome to Computers and Writing Systems , SIL International: Source of free Unicode fonts in various languages, with a particular focus on more obscure languages and local dialects.

I18n Guy: A website dedicated to program internationalization.

Character Sets And Code Pages At The Push Of A Button , Tex Texin: Code charts from all over.

ascii2uni and uni2ascii: Bidirectional conversion between Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI, RFC 2045 Quoted Printable format, and the \uXXXX notation used in Tcl.

Deja Vu Fonts: A set of free fonts based on the Vera fonts, and providing a wider range of characters.

Description

Unicode is a superset of iso8859-1 , which is a superset of ASCII. Whereas ASCII defines 128 characters, and ISO8859-1 defines 191 characters, Unicode defines character codes from 0 to 0x10ffff (1,114,111), and version 12.0 assigns characters to 137929 of them. Unicode calls every thing it assigns a code to a "character", including letters, ligatures, citation marks, diacritic symbols, syllables, symbols used in technical transcription, accents, signs, and any other marks that appear in text and seem to be in need of their own character code. Because characters represented by character codes can be composed, the repertoire of producible perceived characters is far greater.

The majority of characters used in the human languages of the world have character codes between 0 and 65535, and are known as the Basic Multilingual Plane (BMP). Currently a default build of Tcl is only capable of handling these characters, but work is underway to change that, and workarounds requiring non-default build-time configuration options exist.

There are various byte-serializations for representing Unicode character sequences, including utf-8, utf-16, and utf-32, where 8, 16, and 32 are the minimal storage requirements, respectively, for each character code.

The following things are also specified by Unicode:

Character classes such as capitalization, and sort order.
Rendering hints
Composition of characters from individual codepoints, and decomposed into individual code points.
Normalization of code-point sequences.
Hyphenation and line-breaking.
Boundaries of words and sentences.
User interaction for processes such as text deletion and highlighting.

RS, PYK : Unicode versions 3.0 and earlier specified characters in the u0000-uFFFD range, known as the "basic multilingual plane" (BMP) so any character could be stored in 16 bits. Version 3.1 added characters in the supplementary multilingual plane (SMP, u10000-u1FFFF), supplementary ideographic plane (SIP u20000-u2FFFF), and the supplementary special purpose plane (SMP, uE0000-uEFFFF). A character encoded in UTF-8 requires one to four bytes of storage:

0ppppppp
110ppppp 10pppppp
1110pppp 10pppppp 10pppppp
11110ppp 10pppppp 10pppppp 10pppppp

where "p" is a "payload" bit.

AMG, PYK: RFC 3629 3269 limits Unicode to \u0 through \U10ffff, so it's only necessary to encode 21 bits. Consequently, a valid UTF-8 sequence can only range from 1 through 4 bytes in length. Joel Spolsky's article is wrong about this.

Thus, bytes 0xf8 and greater are illegal. 0xf8 through 0xfb would have introduced a five-byte sequence. 0xfc and 0xfd would have introduced a six-byte sequence. 0xfe and 0xff have always been forbidden

0xc0 and 0xc1 are also forbidden since they could only appear as the first byte of an ASCII character being encoded using two bytes, though UTF-8 requires that the shortest available encoding always be used. Tcl intentionally breaks this rule by encoding ASCII NUL as 0xc0 0x80 so that a NUL can appear in text without being interpreted as a terminator. Even in Tcl, 0xc1 is never used.

UTF-8 has 10 illegal bytes out of 256, or 3.9%. Presumably applications (such as Tcl!) can (and do!) assign custom meaning to these bytes, but the resultant string would not be valid for data interchange.

comp.lang.tcl 2008-04:

Newsgroups: comp.lang.tcl
From: [email protected]
Date: Sat, 26 Apr 2008 11:55:45 -0700 (PDT)
Local: Sat, Apr 26 2008 2:55 pm 
Subject: unicode - get character representation from \uxxx notation

Hello, 
to show my problem see the following example: 

> set tcl_patchLevel 
8.5.3b1 

> set str "n\u00E4mlich" 
nämlich 

> set c 0xE4 
> set str "n\\uformat %04.4X $chmlich" 
n\u00E4mlich 

How do I get the \u00E4 in the character representation let's say 
iso8859-1 ? 

> encoding convertto iso8859-1 $str 


Newsgroups: comp.lang.tcl
From: [email protected]
Date: Sat, 26 Apr 2008 14:21:27 -0700 (PDT)
Local: Sat, Apr 26 2008 5:21 pm 
Subject: Re: unicode - get character representation from \uxxx notation

To convert the hex number expressed as a string 0x00e4 to a Unicode 
character, use: 

format "%c" 0x00e4 


You can then use encoding convertto to convert this to another 
encoding, e.g.: 


encoding convertto iso8859-1 format "%c" 0x00e4

Handling characters beyond the BMP

LV 2008-07-08:

I've a request from a developer concerning whether Tcl is capable of handling characters larger than the Unicode BMP. His application was using tdom and it encountered the &Ascr; character, which is a script-A, unicode value 0x1D49C, which tdom reports it can't handle because it is limited to UTF-8 chars up to 3 bytes in length.

What do Tcl programmers do to properly process the longer characters?

Note this is in an enterprise setting. Finding a solution is critical in the publishing (web or print) arena.

RS 2008-07-09: Unicode out of BMP (> U+FFFF) requires a deeper rework of Tcl and Tk: we'd need 32 bit characterss and/or surrogate pairs. UTF-8 at least can deal with 31-bit Unicodes by principle.

LV In July 2008 there was some discussion on the TCT mailing list about ways that the Tcl code itself could evolve to handle things better. But for right now, users have to face either dealing with their wide unicode via a different programming language in some way (whether converting wide characters to some other similar character, using some sort of macro representation, etc.)

AMG, 2015: It's been seven years since the above discussion. What progress has been made?

tcl.h contains the comment:

: "Tcl is currently UCS-2 and planning UTF-16 for the Unicode string rep that Tcl_UniChar represents. Changing the size of Tcl_UniChar is not supported."

Fast random access to characters is quite important, e.g. for regular expressions, so I don't see how standard UTF-16 meets Tcl's needs unless augmented by some kind of indexing mechanism. Maybe the thought is reduced performance is acceptable for strings outside the BMP due to their assumed rarity, though I hope for logarithmic rather than linear, perhaps with some caching to further optimize the common situation of the sought-for character indexes being near each other.

But this is kind of a worst-of-both-worlds sort of deal. If you're going to have to pay for variable-width representation, might as well go with UTF-8 rather than -16.

Proposed fast indexing scheme for UTF-8

AMG: I invented a (hopefully) fast indexing scheme for UTF-8 strings, though it could certainly be adapted for UTF-16.

Instead of the current linear time UTF-16 conversion step, make an array storing the byte index of every nth character other than the first. During lookup, divide the sought-for character index c by n, then subtract one to get the array slot which stores the byte index for the start of the segment. (No need to store the start of the first segment; it's always zero!) Then scan forward one character at a time, covering at most n-1 characters.

For best performance, let n be a compile-time constant power of 2. This allows all division and modulo operations to be implemented in terms of shifts and masks.

The most obvious optimization is to discard the indexing array if the byte count and the character count turn out to be equal. This means the string is ASCII, so no UTF-8 magic is required.

For compression, instead of b (byte index), have the array store b-c, i.e. the number of UTF-8 continuation bytes preceding the segment. Upon lookup, add c-(c%n). This can reduce memory usage by letting strings with fewer continuation bytes use unsigned 8- or 16-bit array slots. This subtraction also makes the next optimization simpler.

Take the difference between the current and next array slot values (additionally subtract n if not doing the above compression) to get the number of UTF-8 continuation bytes in the segment. During the scan, decrement this value for each continuation byte encountered. When the remaining count is zero, it's known that the remaining characters are one byte each, so jump straight to the character. If a segment is all ASCII to begin with, this optimization kicks in immediately.

Another potential optimization is to scan backwards from the start of the next segment (or end of string) if the character index modulo n is greater than some threshold. Probably should put the threshold at 3*n/4 since backward UTF-8 scans have to be done one byte at a time, whereas forward scans are one whole character at a time. This can be combined with the above optimization.

Yet another optimization is to remember the b result of the previous index lookup and scan relative to it if within n characters forward or whatever threshold backwards.

The maximum number of array slots is (byte_count/n)-1, so allocation can be done right away if the byte count is known but not the character count. Though if the string is merely NUL-terminated and not (byte)length-counted, then it's necessary to either make two passes through the string or to allocate an initial guess then geometrically grow the allocation if the guess is short.

Let s = "Ég get etið gler án þess að meiða mig."
Let e be the UTF-8 encoding of s. e = {c3 89 67 20 67 65 74 20 65 74 69 c3 b0 20 67 6c 65 72 20 c3 a1 6e 20 c3 be 65 73 73 20 61 c3 b0 20 6d 65 69 c3 b0 61 20 6d 69 67 2e}
Let n = 8.
Since n is a power of two, let n = 1<<p so p = 3. Also let m = (1<<p)-1 = 7.
The goal is to find character at index c = 22.
Apply the subtraction optimization to get f = {1 2 4 6}. This array is obtained by scanning through the string one character at a time and recording b-c when !(c&m)&&c.
First, find i = c>>p = 2.
- This is nonzero, so do array lookup.
- b = f[i-1] = f[1] = 2.
- c = c&m = 6.
- If i is zero, instead set b = 0.
Next, scan forward one character at a time until c is zero.
- At each step, decrement c and increment b by the number of leading 1-bits of e[b].
This leaves b = 26, which is the byte index of character c = 22.

Using the backwards scan optimization, the above could have instead started at b = f[i] = 4 and c = n-(c&m) = 2, then scanned backwards. Decrement b at each byte. At each non-continuation byte (i.e. (e[b]&0xc0)!=0x80), decrement c. (Be careful near the end of the string.)

If the continuation byte counting optimization is used, it's known that the segment contains only two continuation bytes because f[i]-f[i-1] = 4-2 = 2. When characters 17 "á" and 20 "þ" are found, it's also known that no continuation bytes remain, so after "þ", simply add the remaining c to b to get the final b.

I haven't thought as much about updating the fast seek index when modifying the string. I don't see why appending to the string would invalidate the already-computed index; simply add to the end. But inserting/replacing/deleting a substring would probably take linear time due to recomputing the index from the start of the edit. I don't really mind that though because this operation already takes linear time due to memmove().

Unicode combining characters

AMG: How are combining characters handled? They seem to be treated as individual characters, and they're only combined in the display. Trouble with this is that the cursor can go between combining characters, along with similar problems like cutting a string in the middle of what's called a grapheme cluster.

I wish for a way to treat a character along with all its combining characters (a grapheme cluster) as a logical unit. For instance, [string length] would return the number of grapheme clusters, not the number of code points. New commands would have to be defined to pick apart a grapheme cluster into constituent code points. I imagine most programs will want grapheme clusters to be atomic.

Behind the scenes, Tcl could even normalize strings, though I'm not sure whether this should be automatic, manual, or configurable.

character

Category Glossary

Arts and Crafts of Tcl-Tk Programming