Characters are abstractions of writing elements (e.g. letters, digits, punctuation characters, Chinese ideographs, ligatures...). In Tcl since 8.1, characters are internally represented with Unicode (see Unicode and UTF-8), which can be seen as unsigned integers between 0 and 65535 (recent Unicode versions have even crossed that boundary, but the Tcl implementation currently uses a maximum of 16 bits). Convert between numeric Unicode and characters with
set char [format %c $int] set int [scan $char %c]
Watch out that int values above 65535 produce 'decreasing' characters again, while negative int even produces two bogus characters. format does not warn, so better test before calling it.
Sequences of characters are called strings. Characters are no separate data type in Tcl, but represented as strings of length one (everything is a string). Represented as UTF-8, a character can be one to three bytes long in memory or file. Find out the bytelength of a character with
string bytelength $c ;# assuming [string length $c]==1
String routines can be applied to single characters too, e.g [string toupper] etc. Find out whether a character is in a given set (a character string) with
expr {[string first $char $set]>=0}
As Unicodes for characters fall in distinct ranges, checking whether a character's code lies withing a range allows more or less rough classification of its category:
proc inRange {from to char} { # generic range checker set int [scan $char %c] expr {$int>=$from && $int <= $to} } interp alias {} isGreek {} inRange 0x0386 0x03D6 interp alias {} isCyrillic {} inRange 0x0400 0x04F9 interp alias {} isHangul {} inRange 0xAC00 0xD7A3
See also Unicoded integer sets - Characters, glyphs, code-points, and byte-sequences - Non-ASCII characters