Version 1 of An anomaly in case conversion

Updated 2003-05-07 20:58:48

Richard Suchenwirth -- Case conversion means changing uppercase characters to lowercase, or viceversa. Tcl does this quite nicely with string toupper/tolower, even with characters outside the ASCII range, e.g. the Cyrillic alphabet.

There is however one case in Turkish where special rules apply. The uppercase of our well-known letter "i" must be the dotted I \u0130, while the lowercase of good old "I" shall be a dotless i, \u0131. This means the dot is treated as a diacritic, which is pretty logical, but contrary to the habit in other languages. Here's a nifty routine for use in Turkish applications that treats these special cases as well as the rest of the conversion:

 proc tr:to {cmd args} {
   switch -- $cmd {
        upper {regsub -all i $args \u0130 args}
        lower {regsub -all I $args \u0131 args}
        default {
                return -code error "bad option \"$cmd\":\
                                 must be upper or lower"}
   }
   string to$cmd $args
 }
 # usage examples
 tr:to upper izmir
 tr:to lower YILDIZ 

Notice how the minor command, after filtering for correctness, is pasted into the string to$cmd call.

See also Eurolish for easy input of Turkish diacritics, and The Lish family for the whole picture.


An even worse anomaly, which is not correctly reversible, exists in German: the lowercase Eszet/scharfes S (�, \u00DF) corresponds to two uppercase letters SS, but not all SS sequences may be lowercased to �.


Greek Sigma: There are two different lowercase forms for the Greek letter Sigma, \u03C2 (used at end of word only) and \u03c3 (used in all other positions), but only one uppercase \u03a3 (the preceding \u03a2 is not used, so for software that wants to keep this distinction, it might be 'abused' for uppercase final Sigma...) RS


LV: Richard, has this special case been mentioned to Scriptics so that they might have the routines do the right thing without programmers having to special case things?

RS: No. The problem is that there is no general solution. Even a system localized in Turkey would be wrong in always toupper/lowering as above, if dealing with filenames - imagine how much code would break (there's files like CONFIG.SYS...). The application must have the 'conscience' that a string is Turkish, and only then apply tr:to {upper,lower} to it.


KBK:Case conversion also is different in Dutch - where converting 'ijssel' to titlecase results in 'IJssel'.


i18n - writing for the world | Arts and crafts of Tcl-Tk programming