An anomaly in case conversion

Description

Richard Suchenwirth 2001-06-11: Case conversion means changing uppercase characters to lowercase, or vice versa.

Tcl does this quite nicely with string toupper/tolower, even with characters outside the ASCII range, e.g. the Cyrillic alphabet.

There is however one case in Turkish where special rules apply. The uppercase of our well-known letter "i" must be the dotted I \u0130, while the lowercase of good old "I" shall be a dotless i, \u0131. This means the dot is treated as a diacritic, which is pretty logical, but contrary to the habit in other languages. Here's a nifty routine for use in Turkish applications that treats these special cases as well as the rest of the conversion:

proc tr:to {cmd args} {
    switch -- $cmd {
        upper {regsub -all i $args \u0130 args}
        lower {regsub -all I $args \u0131 args}
        default {error "bad option '$cmd': must be upper or lower"}
    }
    string to$cmd $args
}
# usage examples
tr:to upper izmir         ;# produces Ä°ZMÄ°R
tr:to lower YILDIZ        ;# produces yıldız

Notice how the minor command, after filtering for correctness, is pasted into the string to$cmd call.

See also Eurolish for easy input of Turkish diacritics, and The Lish family for the whole picture.


An even worse anomaly, which is not correctly reversible, exists in German: the lowercase Eszet/scharfes S (ß, \u00DF) corresponds to two uppercase letters SS, but not all SS sequences may be lowercased to ß.


Greek Sigma: There are two different lowercase forms for the Greek letter Sigma, \u03C2 (used at end of word only) and \u03c3 (used in all other positions), but only one uppercase \u03a3 (the preceding \u03a2 is not used, so for software that wants to keep this distinction, it might be 'abused' for uppercase final Sigma...) RS


LV: Richard, has this special case been mentioned to Scriptics so that they might have the routines do the right thing without programmers having to special case things?

RS: No. The problem is that there is no general solution. Even a system localized in Turkey would be wrong in always toupper/lowering as above, if dealing with filenames - imagine how much code would break (there's files like CONFIG.SYS...). The application must have the 'conscience' that a string is Turkish, and only then apply tr:to {upper,lower} to it.


KBK:Case conversion also is different in Dutch - where converting 'ijssel' to titlecase results in 'IJssel' (see Things Dutch).

AM: Alas, precious little software is aware of this - one culprit being MS Word (unless you take the pain to instruct it do the "right thing". At the beginning of a word any combination "ij" is to be capitalised as "IJ".