'''string is''' ''class ?'''''-strict'''''? ?'''''-failindex''' ''varname? string'' Returns '''1''' if ''string'' is a valid member of the specified character class, otherwise returns '''0'''. If '''-strict''' is specified, then an empty string returns '''0''', otherwise an empty string will return '''1''' on any class. If '''-failindex''' is specified, then if the function returns '''0''', the index in the string where the class was no longer valid will be stored in the variable named ''varname''. The ''varname'' will not be set if the function returns '''1'''. The following character classes are recognized (the class name can be abbreviated): alnum: Any Unicode alphabet or digit character. alpha: Any Unicode alphabet character. ascii: Any character with a value less than \u0080 (those that are in the 7-bit ascii range). boolean: Any of the forms allowed to Tcl_GetBoolean. control: Any Unicode control character. digit: Any Unicode digit character. Note that this includes characters outside of the [[0-9]] range. double: Any of the valid forms for a double in Tcl, with optional surrounding whitespace. In case of under/overflow in the value, 0 is returned and the varname will contain -1. false: Any of the forms allowed to Tcl_GetBoolean where the value is false. graph: Any Unicode printing character, except space. integer: Any of the valid forms for an integer in Tcl, with optional surrounding whitespace. In case of under/overflow in the value, 0 is returned and the varname will contain -1. list: Any proper list structure, with optional surrounding whitespace. In case of improper list structure, 0 is returned and the varname will contain the index of the “element” where the list parsing fails, or -1 if this cannot be determined. (See: [string is list]) lower: Any Unicode lower case alphabet character. print: Any Unicode printing character, including space. punct: Any Unicode punctuation character. space: Any Unicode space character. true: Any of the forms allowed to Tcl_GetBoolean where the value is true. upper: Any upper case alphabet character in the Unicode character set. wideinteger: Any of the valid forms for a wide integer in Tcl, with optional surrounding whitespace. In case of under/overflow in the value, 0 is returned and the varname will contain -1. (See: [string is wideinteger]) wordchar: Any Unicode word character. That is any alphanumeric character, and any Unicode connector punctuation characters (e.g. underscore). xdigit: Any hexadecimal digit character ([[0-9A-Fa-f]]). In the case of '''boolean''', '''true''' and '''false''', if the function will return '''0''', then the ''varname'' will always be set to '''0''', due to the varied nature of a valid boolean value. ---- **Discussion** <>Heritage of [[string is ...]] A comparison with the '''ctype(3)''' man page (e.g., [http://www.gsp.com/cgi-bin/man.cgi?section=3&topic=ctype]) shows much agreement with the [string is] classes. A not-too-daring guess is that [C] has contributed also this piece of [Tcl heritage]. ---- <> <>Unicode and [[string is digit ...]] Note that ''string is digit'' recognises Unicode digits that are not valid in `expr` expressions '''string is digit''' ''string'' will return 1 if ''string'' is composed of "Any Unicode digit character." Then it goes on to say that "this includes characters outside of the [[0-9]] range." What other characters, aside from [[0-9]], are members of the '''digit''' character class? [RS] As Tcl is great for [introspection], a few lines of code give the answer: proc udigits max { set res {} for {set i 0} {$i<=$max} {incr i} { if [string is digit [format %c $i]] { append res "\\u[format %04x $i] " } } set res } % udigits 65535 \u0030 \u0031 \u0032 \u0033 \u0034 \u0035 \u0036 \u0037 \u0038 \u0039 \u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669 \u06f0 \u06f1 \u06f2 \u06f3 \u06f4 \u06f5 \u06f6 \u06f7 \u06f8 \u06f9 \u0966 \u0967 \u0968 \u0969 \u096a \u096b \u096c \u096d \u096e \u096f \u09e6 \u09e7 \u09e8 \u09e9 \u09ea \u09eb \u09ec \u09ed \u09ee \u09ef \u0a66 \u0a67 \u0a68 \u0a69 \u0a6a \u0a6b \u0a6c \u0a6d \u0a6e \u0a6f \u0ae6 \u0ae7 \u0ae8 \u0ae9 \u0aea \u0aeb \u0aec \u0aed \u0aee \u0aef \u0b66 \u0b67 \u0b68 \u0b69 \u0b6a \u0b6b \u0b6c \u0b6d \u0b6e \u0b6f \u0be7 \u0be8 \u0be9 \u0bea \u0beb \u0bec \u0bed \u0bee \u0bef \u0c66 \u0c67 \u0c68 \u0c69 \u0c6a \u0c6b \u0c6c \u0c6d \u0c6e \u0c6f \u0ce6 \u0ce7 \u0ce8 \u0ce9 \u0cea \u0ceb \u0cec \u0ced \u0cee \u0cef \u0d66 \u0d67 \u0d68 \u0d69 \u0d6a \u0d6b \u0d6c \u0d6d \u0d6e \u0d6f \u0e50 \u0e51 \u0e52 \u0e53 \u0e54 \u0e55 \u0e56 \u0e57 \u0e58 \u0e59 \u0ed0 \u0ed1 \u0ed2 \u0ed3 \u0ed4 \u0ed5 \u0ed6 \u0ed7 \u0ed8 \u0ed9 \u0f20 \u0f21 \u0f22 \u0f23 \u0f24 \u0f25 \u0f26 \u0f27 \u0f28 \u0f29 \u1040 \u1041 \u1042 \u1043 \u1044 \u1045 \u1046 \u1047 \u1048 \u1049 \u1369 \u136a \u136b \u136c \u136d \u136e \u136f \u1370 \u1371 \u17e0 \u17e1 \u17e2 \u17e3 \u17e4 \u17e5 \u17e6 \u17e7 \u17e8 \u17e9 \u1810 \u1811 \u1812 \u1813 \u1814 \u1815 \u1816 \u1817 \u1818 \u1819 \uff10 \uff11 \uff12 \uff13 \uff14 \uff15 \uff16 \uff17 \uff18 \uff19 Here they are "literally": 0 1 2 3 4 5 6 7 8 9 ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ ० १ २ ३ ४ ५ ६ ७ ८ ९ ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯ ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯ ೦ ೧ ೨ ೩ ೪ ೫ ೬ ೭ ೮ ೯ ൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯ ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙ ໐ ໑ ໒ ໓ ໔ ໕ ໖ ໗ ໘ ໙ ༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩ ၀ ၁ ၂ ၃ ၄ ၅ ၆ ၇ ၈ ၉ ፩ ፪ ፫ ፬ ፭ ፮ ፯ ፰ ፱ ០ ១ ២ ៣ ៤ ៥ ៦ ៧ ៨ ៩ ᠐ ᠑ ᠒ ᠓ ᠔ ᠕ ᠖ ᠗ ᠘ ᠙ ０１２３４５６７８９ For instance, \u0660-\u0669 are "Indo-Arabic" digits as used in Arab countries. \uFF10-\uFF19 are "fullwidth" variants of 0-9, etc. [Lars H]: Quite a lot, as you can see; there are plenty of digit sets in Unicode. The authorative source on the subject should be the Unicode Character Database (see [http://www.unicode.org/Public/UNIDATA/UCD.html] for format and links), but whether Tcl really uses that is another matter. A comparison of the above with the UCD shows that [[string is digit]] returns 1 for most (but not quite all!) of the characters from the class Nd (decimal digits), so that is probably what it is supposed to test. ''[escargo] 14 Mar 2006'' - I had my own peek starting at http://www.unicode.org/charts/symbols.html where there are links to four PDF files under '''Numbers and Digits''': ASCII Digits, Fullwidth ASCII Digits, Number Forms, and Super and Subscripts. All of these might legitimately contain digits. [NEM] 14 Mar 2006: Interesting. Note though that most of these "digits" are not valid for expressions fed to [expr]. So, if you are using [[string is digit]] to validate arguments to expressions then you probably have a bug. [escargo] - If there are ''digits'' that are not valid when fed to [expr], is that a problem in the implementation of [expr]? It almost makes me think that there ought to be a ''string normalize'' that changes Unicode digits of different stripes down to the [[0-9]] range expected by [expr]. It's like if the character looks to humans like a digit but it is not one to [expr] then the problem is really in [expr]. (The solution ''might not'' be in [expr] though.) [NEM] My gut reaction is the same as yours: that [expr] should be enhanced to recognize these alternative numeral characters as digits. This seems to be sensible, and would be backwards compatible as the characters are apparently not allowed at all by [expr] currently. The expr(n) manpage could also do with clarification about what counts as an integer or float -- at present, it just says "decimal" (or octal/hexidecimal), as far as I can see. [slebetman] 15 Mar 2006: Well, if you are not using a number larger than 18446744073709551615.99, then you can use [[string is double $number]] as the test instead of [[string is digit $number]]. Currently under [Microsoft Windows] on 32-bit Pentium [[string is double]] returns 0 for 18446744073709551616 and larger (found out by trial and error). Alternatively, if you want to check for integers and you're not using a number larger than 4294967295 then you can use [[string is integer $number]]: (bin) 59 % string is digit \u0669 1 (bin) 60 % string is double \u0669 0 ---- <> <>Octal and [[string is integer ...]] ''string is integer'' has a complication because a leading zero implies octal, but 09 etc are not valid octal numbers '''[NLast] - 2010-01-05''' '''string is integer''' shows unexpected behaviour as does '''string is double''': ====== 1 % string is integer 09 0 2 % string is integer 9 1 ====== Why is '''09''' not an integer? [AMG]: 'Cuz it's not, and neither is 08. The leading zero indicates that it is an octal number, but 8 and 9 are not valid octal digits. By the same token, 12ABC is not a valid integer, even though it looks like hexadecimal. ====== % expr {09 + 1} missing operator at _@_ in expression "0_@_9 + 1"; looks like invalid octal number ====== Valid formats for integers: ====== 0 - zero is always valid 56789 - lack of leading zeros - decimal 0xABCDE - leading 0x or 0X - hexadecimal 034567 - leading 0 - octal 0o34567 - leading 0o or 0O - octal (Tcl 8.5+) 0b01010 - leading 0b or 0B - binary (Tcl 8.5+) ====== If you want to force a number with leading zeros to be interpreted as decimal, use the %d [scan] format code. ====== scan 09 %d ;# returns 9 scan 09 %i ;# returns 0, since the 9 is ignored as invalid scan 0123 %d ;# returns 123 scan 0123 %i ;# returns 83, which is 1*8² + 2*8¹ + 3*8⁰ = 64 + 16 + 3 ====== '''[NLast] - 2010-01-06 15:25:38''' Thank you [AMG] for sharing your opinion. However it doesn't sound logical for me. The main reason is absence of class "octal" (or "hexadecimal"): we only have class "integer". You've just written: ''''"Valid formats for integers: ... blah-blah ... 034567 - leading 0 - octal"'''' meaning 034567 '''is''' digital ('''namely''' octal). That's what you've stated, not me. And that's right. I just expected '''string is''' to test if the string is compatible with '''expr'''. And now I see I have to use '''catch''' which is just not very nice. [slebetman]: Actually, for what you want to do (test if string is compatible with expr), [string is] IS doing what you want. You should note that in [expr] 09 is an invalid number because '''9''' is an invalid octal digit. And [string is] is telling you that. The 0nn format for octals was inherited from C. Yes it sucks. Yes it's prone to bugs - especially when you're trying to calculate time. And yes there was a proposal to change it and yes the majority agreed that it should go away but the Tcl Core Team highly values backwards compatibility so for now it's staying until we're sure removing the syntax doesn't break anything. But it is not an unexpected behavior of [[string is]]. It's the behavior of Tcl itself in how numbers are interpreted. And besides C and Tcl, 09 is also an invalid octal number in: Java, Perl, Ruby, Javascript, Python.. in fact most mainstream programming languages from the 60s until today (I blame it on C). C# stands out for not supporting the 0nn notation. [AMG]: So... why don't we have [[string is decimal]], [[string is octal]], [[string is hexadecimal]], and [[string is binary]]? One question that comes to mind is whether or not they will all accept "0". Is "0" an octal string? It starts with a leading zero. Is "0" a hexadecimal string? It doesn't start with "0x". Is "00" valid decimal? It has an extra leading zero, signifying octal. And so on. Probably we don't need to bloat the core with these new commands, since they can easily be implemented using [regexp], enabling the programmer to decide on a case-by-case basis whether "0" is valid octal, etc. ---- <> <>Large numbers and [[string is integer ...]] 10000000000 is too large to be a '''string is integer''' (fixed in Tcl 8.5) [Lars H]: What really sucks about '''string is integer''' is the following: ====== % set tcl_platform(wordSize) 4 % string is integer 10000000000 0 % string is integer 1000000000 1 ====== Whether it accepts something as an "integer" depends on whether it will fit into a machine word! [expr] is, as of Tcl 8.5, relieved of that silly restriction. Sure, some other commands still rely on monoword arithmetic: ====== % lindex {} 1000000000 % lindex {} 10000000000 Error: bad index "10000000000": must be integer?[+-]integer? or end?[+-]integer? ====== but these are arguably limitations/bugs that should be fixed, rather than features to be cherished. [AMG]: What are the real limits on [[string is integer]]? On the [timebox] page [http://wiki.tcl.tk/28265#pagetoc6c348a14], I found that it seems to accept any integer whose magnitude (ignoring sign) fits in a machine word (32-bit integer, for me). What use is that? Repeating here: ====== % string is integer 4294967295 1 % string is integer -4294967295 1 % string is integer 4294967296 0 % string is integer -4294967296 0 ====== '''[DGP]''' Supporting the customary C hacker practice of using hex notation to denote their values, and not having to worry about what leading sign is there. ====== % string is integer 0xdeadbeef 1 ====== ---- <> <>[EIAS] humor [AMG]: Here's an April 1 [TIP] for you: [[string is string]]. :^) It'll return true (1) whenever given one argument, no matter what that argument is, since [everything is a string]! A -strict option would cause it to return false (0) if given empty string. ''(update)'' On second thought, -strict would have no effect, since an empty string ''is still a string!!'' ---- <> **See also** * [string] <> Command | String Processing | Tcl syntax help