Purpose: Draft of a paper that may somehow be published, but put here for community discussion - later evolved to i18n - Tcl for the world
Richard Suchenwirth 2001-06-22 - Wherever you live, most of the world's population are foreigners. They speak other languages, and often also use different characters from yours to write their language. In many situations you can still get by with English, but globalization of economy, and the Internet, raise a growing demand for internationalization of software. That "i" word is used so often that it's typically abbreviated to "i18n" - "i", 18 characters between, and an "n" in the end. Likewise but rarer, "localization" (adapting to one local writing system, which may also involve number, date, or currency formatting, or sort ordering of special characters) is "l10n".
A quick survey of the world's languages shows a considerable percentage use Chinese with its notable ideographic writing. I've seen forecasts that in a few years Chinese will be the dominating language on the Web. Thousands of Chinese characters, sometimes in different forms, are also part of the Japanese and Korean writing systems, so they're often referred to as CJK characters. In addition, Japan (Kana) and Korea (Hangul) have native writing systems which also exceed the size of the Latin (Roman, English) alphabet.
English is of course also used very widely; even in India it has national language status. Add the US, (most parts of) Canada, Australia, and of course it still dominates communications on the Web. But talking of writing: even in England, good old ASCII is not enough - they have one frequent character, the Pound Sterling sign and more recently, the EuroCurrency symbol, which localization e.g. of keyboards have to provide so as to remain supported of usage. The Latin alphabet is used in many European countries, but virtually each language adds special characters with accents, umlaut dots, diacritics- together more than the upper half of ASCII can hold. Thus, a number of extensions to the 7-bit ASCII encoding (an encoding is a mapping from characters to code numbers) were defined by the International Standards Organization (ISO), all reusing the byte values 80..FF for different characters. Those include the "other alphabets", Greek, Cyrillic (as used in Russia and other countries), Hebrew, Arabic (also used for Persian, Bengali etc.); but mostly variations over the Latin alphabet are encoded in ISO 8859-1 to 15. These encodings are all in 8-bit bytes, but the encoding which applies to a given text has to be assumed or communicated.
For CJK ideographs (with over 6000 characters), 8 bits were insufficient anyway, so various two-byte encodings ("double-byte character sets", DBCS) are used there - with a special problem: some characters are so rare that they're not included in the standard set, but they do occur in names of persons or (small) places - so even with several thousand characters at hand, you have no way of writing these names correctly on computers. The national standard bodies in China and Japan have several times added auxiliary character sets (and the Unicode is growing from those), but chances still are that you run into an unencoded character.
The Unicode just mentioned was an attempt to unify all modern character encodings into one consistent 16-bit representation. Consider a page with a 16x16 table. Half of that is filled with ASCII, the other half with EuroLatin-1 (ISO 8859-1). Call that "page 00" and imagine a book of 256 or more such pages (with all kinds of other characters on them, in majority CJK), then you have a pretty clear concept of the Unicode, in which a character's code position is "U+" hex (page number*256+cell number) for instance U+20A4 is the Pound Sterling sign. Initiated by the computer industry, the Unicode has grown together with ISO 10646, the emerging standard providing an up-to-31-bits encoding (one left for parity?) with the same scope. Software must allow Unicode strings to be fit for i18n. From Unicode version 3.1, the 16-bit limit was transcended for some rare writing systems, but also for the CJK Unified Ideographs Extension B - apparently, even 65536 code positions are not enough. The total count in Unicode 3.1 is now 94,140 encoded characters, of which 70,207 are unified Han ideographs!
While 8-bit systems were easily implemented on computers, a pure 16-bit plan demands double the storage capacity. For more than 20 years CJK countries have used a mixed length representation, where ASCII characters (00..7F) use one byte, while for others, including thousands of ideographs, two bytes with values above 0x80 are taken. This "Extended Unix Code" (EUC) or the Japanese "Shift-JIS" variety allow up to 16000 different code positions - enough for a single East Asian country, but still not for a world-wide encoding such as the Unicode. Based on these experiences, the Unicode Transfer Format (UTF-8) was developed that again uses one byte for ASCII codes and two or more for other characters (CJK ideographs take three, or 50% more than in national encodings). There is a strict mapping between Unicode and UTF-8, so even future extensions will fit into the UTF-8 framework. String handling routines had to be adapted for that purpose: the length of a string in characters is no longer the length in bytes; overwriting an ASCII character with a Greek one involves shifting the rest of the string, as the replacement is one byte wider than the original... But be assured, robust implementations have been done. For example, the Tcl scripting language does all strings in UTF-8 and allows conversion between encodings with little effort.
Input: So far, we have only dealt with memory representation of international strings. To get them into the machine, they may at lowest level be specified as escape sequences, e.g. "\u2345" like in Java or Tcl. But most user input will come from keyboards, for which many layouts exist in different countries. In CJK countries, there is a separate coding level between keys and characters: keystrokes, which may stand for the pronunciation or geometric components of a character, are collected in a buffer and converted into the target code when enough context is available (often supported by on-screen menus to resolve ambiguities). Finally, a "virtual keyboard" on screen (e.g. A little Unicode editor), where characters are selected by mouse click, is especially helpful for non-frequent use of rarer characters, since the physical keyboard gives no hints which key is mapped to which other code.
Most often one will hear techniques for handing input referred to as IME or Input Method Editor. One interesting article relating to the input of Japanese language can be found at http://tronweb.super-nova.co.jp/jpnimintro.html . See also A tiny input manager for a pure-Tcl example of making widgets take Russian from the keyboard.
Output: Rendering international strings on displays or printers can pose the biggest problems. First, you need fonts that contain the characters in question. Fortunately, more and more fonts with international characters are available, a pioneer being Bitstream Cyberbit that contains roughly 40000 glyphs (optical representations of characters) and was for some time offered for free download on the Web. Microsoft's Tahoma font also added support for most alphabet writings, including Arabic. Arial Unicode MS delivered with Windows 2000 contains just about all the characters in the Unicode, so even humble Notepad can get truly international with that.
But having a good font is still not enough. While strings in memory are arranged in logical order, with addresses increasing from beginning to end of text, they may need to be rendered in other ways, with diacritics shifted to various positions of the preceding character, or most evident for the languages that are written from right to left: Arabic, Hebrew. This has consequences for cursor movement, line justification, and line wrapping as well. Vertical lines progressing from right to left are popular in Japan and Taiwan - and mandatory if you had to render Mongolian. Indian scripts like Devanagari are alphabets with about 40 characters, but the sequence of consonants and vowels is partially reversed in rendering, and consonant clusters must be rendered as ligatures (joint shape glyph - in Western typesetting a few are used for fi, ff, fl...) of the two or more characters involved - the pure single letters would look very ugly to an Indian. An Indian font for one writing system already contains several hundred glyphs. Unfortunately, Indian ligatures are not contained in the Unicode (while Arabic ones are), so various vendor standards apply for coding such ligatures.
Obviously, i18n poses quite a number of problems. But software for the world cannot do without it, and developing for i18n allows you to get more than a glimpse of the cultural diversity of this planet, and in a way "internationalize" your mind. That's what I like most about i18n work.
Unicode and UTF-8 - Arts and crafts of Tcl-Tk programming
CL intends to include here references to his own "Regular Expressions" column on this subject, Unicode reference material, ...