Version 3 of characters

Updated 2002-11-26 19:25:24

Characters are abstractions of writing elements (e.g. letters, digits, punctuation characters, Chinese ideographs, ligatures...). In Tcl since 8.1, characters are internally represented with Unicode (see Unicode and UTF-8), which can be seen as unsigned integers between 0 and 65535 (recent Unicode versions have even crossed that boundary, but the Tcl implementation currently uses a maximum of 16 bits). Convert between numeric Unicode and characters with

 set char [format %c $int]
 set int  [scan $char %c]

Watch out that int values above 65535 produce 'decreasing' characters again, while negative int even produces two bogus characters. format does not warn, so better test before calling it.

Sequences of characters are called strings. Characters are no separate data type in Tcl, but represented as strings of length one (everything is a string). Represented as UTF-8, a character can be one to three bytes long in memory or file. Find out the bytelength of a character with

 string bytelength $c ;# assuming [string length $c]==1

String routines can be applied to single characters too, e.g [string toupper] etc. Find out whether a character is in a given set (a character string) with

 expr {[string first $char $set]>=0}

As Unicodes for characters fall in distinct ranges, checking whether a character's code lies withing a range allows more or less rough classification of its category:

 proc inRange {from to char} {
     # generic range checker
     set int [scan $char %c]
     expr {$int>=$from && $int <= $to}
 }
 interp alias {} isGreek {}    inRange 0x0386 0x03D6
 interp alias {} isCyrillic {} inRange 0x0400 0x04F9
 interp alias {} isHangul {}   inRange 0xAC00 0xD7A3

See also Unicoded integer sets - Characters, glyphs, code-points, and byte-sequences - Non-ASCII characters


Category Concept | Arts and crafts of Tcl-Tk programming