Version 7 of characters

Updated 2019-08-11 17:49:36 by pooryorick

A character is an abstraction of writing elements (e.g. letters, digits, punctuation characters, Chinese ideographs, ligatures...).

See Also

Unicoded integer sets
Characters, glyphs, code-points, and byte-sequences
Non-ASCII characters
Character

Description

In Tcl version 8.1 and later a string is defined as a sequence of Unicode characters. A Unicode Character in the Basic Multilingual Plane can be seen as seen as unsigned integers between 0 and 65535. Convert between numeric Unicode and characters with:

set char [format %c $int]
set int  [scan $char %c]

Watch out that int values above 65535 produce 'decreasing' characters again, while negative int even produces two bogus characters. format does not warn, so better test before calling it.

Sequences of characters are called strings. Characters are no separate data type in Tcl, but represented as strings of length one (everything is a string). Represented as UTF-8, a character can be one to three bytes long in memory or file. Find out the bytelength of a character with

string bytelength $c ;# assuming [string length $c]==1

String routines can be applied to single characters too, e.g [string toupper] etc. Find out whether a character is in a given set (a character string) with

expr {[string first $char $set]>=0}

As Unicodes for characters fall in distinct ranges, checking whether a character's code lies withing a range allows more or less rough classification of its category:

proc inRange {from to char} {
    # generic range checker
    set int [scan $char %c]
    expr {$int>=$from && $int <= $to}
}
interp alias {} isGreek {}    inRange 0x0386 0x03D6
interp alias {} isCyrillic {} inRange 0x0400 0x04F9
interp alias {} isHangul {}   inRange 0xAC00 0xD7A3

Page Authors

pyk
Richard Suchenwirth