string is

Difference between version 66 and 67 - Previous - Next
'''`[string] is`''' tests the validity of various interpretations of a value.



** Synopsis **

'''string is''' ''class ?'''''-strict'''''? ?'''''-failindex''' ''varname? string''



** Description **

A comparison with the [http://www.gsp.com/cgi-bin/man.cgi?section=3&topic=ctype%|%ctype] documentation shows much agreement with the [string is] classes. A not-too-daring guess is that [C] has contributed also this piece of [Tcl heritage].

Without `-strict`, every `string is` command returns `1` for the [empty
string].  This is because the original motivation for `[string is]` was to
validate incomplete user input.

[AMG]: Note that it doesn't quite meet this goal because there are cases where prefixes of (e.g.) a number are not themselves valid numbers.  The obvious example is "-" which is invalid whereas "-1" is fine.  Another example is "5.2e" which is invalid, whereas "5.2e6" is fine.  Other troublesome situations arise during replace operations which are usually implemented as deletes followed by inserts, with validation following each.  The delete may fail validation even though the insert (which isn't allowed to happen) would have fixed it.



** `string is script` **

Otherwise known as `[info complete]`.



** `string is dict` **

[PYK] 2015-12-27: The most performant way to test whether a string is a valid dictionary is probably `[dict size]`:

======
proc isdict value {
    expr {![catch {dict size $value}]}
}
======



** `string is decimal` and `string is numeric` **

[PYK] 2016-01-14, 2019-03-07:

The following command returns the value with surrounding whitespace removed if
it is a number in decimal representation, and the empty string otherwise:


======
proc isdecimal value {
    # recognizes all numbers with a mantissa, since only decimal representations are allowed.
    set value [string trim $value]
    if {
    [expr {
        [string is double -strict $value]
        && (
            ![string is entier $value]
            ||
            ![regexp {^\s*[+-]*?0[bBoOxX]?} $value]
        )}]
    } {
        return $value
    }

    # account for 0-padded decimal integers
    regsub {^([+-])*0*([^[:space:]]*)$} $value {\1\2} value
    if {[string is double $value]} {
        return $value
    }
}
======

The following command does the same thing for any valid number, but treats
numbers beginning with `0` as decimal instead of octal.:

======
proc isnumeric value {
    set value [string trim $value]
    # Use [string is double] to accept Inf and NaN
    if {[string is double -strict $value]} {
        return $value
    }
    regsub {^\s*([+-])*0[BOXbox]?0*([^[:space:]]*)\s*$} $value {\1\2} value
    if {[string is double -strict $value]} {
        return $value
    }
    return {} 
}
======
These commands are also available as `[ycl%|%ycl::string::isnumeric]` and
`[ycl::string::isdecimal]`



** Dealing with Integers (what's an entier?) **

To test if a string is an integer, one would use '''string is integer''', right?   ''Not so fast!''

A careful perusal of the manual will show that you might want one of several things:

   * '''string is integer''':  for an integer whose absolute value fits in 32 bits (not platform dependent!), written in one of the forms Tcl can parse
   * '''string is entier''':  an integer (of any size), written in one of the forms Tcl can parse
   * '''string is digit''':  a string that consists of only digits

Often, Tcl code shouldn't need to care about the size of an integer, so '''string is entier''' is probably what you want.  See [entier] for some historical etymology.  For more background on the limits of '''string is integer''' through history, see [32-bit integer overflow].

Some examples:

======
% string is integer -1234
1
% string is integer 0xdeadbeef   ;# hex is fine too
1
% string is integer 007          ;# so is octal ...
1
% string is integer 09           ;# but this is not!
0
% string is integer ""
1
% string is integer -strict ""
0
% string is integer 12345678999  ;# too big for 32 bits!
0
% string is entier 12345678999
1
======

`[string] is integer` tests whether a ''unsigned'' value '''fits in 32 bits''':

======
% set x [expr 2**32-1]
4294967295
% string is integer $x
1
% string is integer -$x
1
% incr x
4294967296
% string is integer $x
0
% string is integer -$x
0
======

If you find yourself tripping over decimal numbers written with leading `0`, the solution is probably to `[scan]` them as [entier%|%entier] values:

======
% scan 012345678999 %d x
1
% set x
12345678999 
======



** [Tcl and octal numbers%|%Octal] **

`string is integer` has a complication because a leading zero implies octal, but `09` etc are not valid octal numbers

'''[NLast] 2010-01-05''':

`string is integer` shows unexpected behaviour, as does `string is double`:

======
1 % string is integer 09
0
2 % string is integer 9
1
======

Why is `09` not an integer?

[AMG]: 'Cuz it's not, and neither is `08`.  The leading zero indicates that it is an octal number, but `8` and `9` are not valid octal digits.  By the same token, `12ABC` is not a valid integer, even though it looks like hexadecimal.

======
% expr {09 + 1}
missing operator at _@_
in expression "0_@_9 + 1";
looks like invalid octal number
======

Valid formats for integers:

======none
      0 - zero is always valid
  56789 - lack of leading zeros - decimal
0xABCDE - leading 0x or 0X      - hexadecimal
 034567 - leading 0             - octal
0o34567 - leading 0o or 0O      - octal (Tcl 8.5+)
0b01010 - leading 0b or 0B      - binary (Tcl 8.5+)
======

If you want to force a number with leading zeros to be interpreted as decimal, use the %d [scan] format code.

======
scan 09 %d   ;# returns 9
scan 09 %i   ;# returns 0, since the 9 is ignored as invalid
scan 0123 %d ;# returns 123
scan 0123 %i ;# returns 83, which is 1*8² + 2*8¹ + 3*8⁰ = 64 + 16 + 3
======

'''[NLast] 2010-01-06 15:25:38''':

Thank you [AMG] for sharing your opinion. However it doesn't sound logical for me. The main reason is absence of class "octal" (or "hexadecimal"): we only have class "integer". You've just written: ''''"Valid formats for integers: ... blah-blah ... 034567 - leading 0 - octal"'''' meaning 034567 '''is''' digital ('''namely''' octal). That's what you've stated, not me. And that's right.

I just expected `[string is]` to test if the string is compatible with `[expr]`. And now I see I have to use `[catch]`, which is just not very nice.

[slebetman]: Actually, for what you want to do (test if string is compatible with `[expr]`), `string is` '''is''' doing what you want. In `[expr]`, `09` is an invalid number because `9` is an invalid octal digit, `string is` is telling you that.

The `0`''nn'' format for octals was inherited from [C]. Yes it sucks. Yes it's prone to bugs - especially when you're trying to calculate time. And yes there was a proposal to change it and yes the majority agreed that it should go away but the Tcl Core Team highly values backwards compatibility so for now it's staying until we're sure removing the syntax doesn't break anything.

But it is not an unexpected behavior of `string is`. It's the behavior of Tcl itself in how numbers are interpreted. And besides C and Tcl, `09` is also an invalid octal number in: Java, Perl, Ruby, Javascript, Python.. in fact most mainstream programming languages from the 60s until today (I blame it on C). [C#] stands out for not supporting the 0nn notation.

[AMG]: So... why don't we have [[string is decimal]], [[string is octal]], [[string is hexadecimal]], and [[string is binary]]?  One question that comes to mind is whether or not they will all accept "0".  Is "0" an octal string?  It starts with a leading zero.  Is "0" a hexadecimal string?  It doesn't start with "0x".  Is "00" valid decimal?  It has an extra leading zero, signifying octal.  And so on.  Probably we don't need to bloat the core with these new commands, since they can easily be implemented using [regexp], enabling the programmer to decide on a case-by-case basis whether `0` is valid octal, etc.



** Unicode and `string is digit` **

`string is digit` recognises Unicode digits that are not valid in `expr` expressions.  What other characters, aside from [[0-9]], are members of the '''digit''' character class? 

[RS]: As Tcl is great for [introspection], a few lines of code give the answer:

======
proc udigits max {
    set res {}
    for {set i 0} {$i<=$max} {incr i} {
        if {[string is digit [format %c $i]]} {
            append res "\\u[format %04x $i] "
        }
    }
    set res
}
======

======none
% udigits 65535
\u0030 \u0031 \u0032 \u0033 \u0034 \u0035 \u0036 \u0037 \u0038 \u0039 \u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669 \u06f0 \u06f1 \u06f2 \u06f3 \u06f4 \u06f5 \u06f6 \u06f7 \u06f8 \u06f9 \u0966 \u0967 \u0968 \u0969 \u096a \u096b \u096c \u096d \u096e \u096f \u09e6 \u09e7 \u09e8 \u09e9 \u09ea \u09eb \u09ec \u09ed \u09ee \u09ef \u0a66 \u0a67 \u0a68 \u0a69 \u0a6a \u0a6b \u0a6c \u0a6d \u0a6e \u0a6f \u0ae6 \u0ae7 \u0ae8 \u0ae9 \u0aea \u0aeb \u0aec \u0aed \u0aee \u0aef \u0b66 \u0b67 \u0b68 \u0b69 \u0b6a \u0b6b \u0b6c \u0b6d \u0b6e \u0b6f \u0be7 \u0be8 \u0be9 \u0bea \u0beb \u0bec \u0bed \u0bee \u0bef \u0c66 \u0c67 \u0c68 \u0c69 \u0c6a \u0c6b \u0c6c \u0c6d \u0c6e \u0c6f \u0ce6 \u0ce7 \u0ce8 \u0ce9 \u0cea \u0ceb \u0cec \u0ced \u0cee \u0cef \u0d66 \u0d67 \u0d68 \u0d69 \u0d6a \u0d6b \u0d6c \u0d6d \u0d6e \u0d6f \u0e50 \u0e51 \u0e52 \u0e53 \u0e54 \u0e55 \u0e56 \u0e57 \u0e58 \u0e59 \u0ed0 \u0ed1 \u0ed2 \u0ed3 \u0ed4 \u0ed5 \u0ed6 \u0ed7 \u0ed8 \u0ed9 \u0f20 \u0f21 \u0f22 \u0f23 \u0f24 \u0f25 \u0f26 \u0f27 \u0f28 \u0f29 \u1040 \u1041 \u1042 \u1043 \u1044 \u1045 \u1046 \u1047 \u1048 \u1049 \u1369 \u136a \u136b \u136c \u136d \u136e \u136f \u1370 \u1371 \u17e0 \u17e1 \u17e2 \u17e3 \u17e4 \u17e5 \u17e6 \u17e7 \u17e8 \u17e9 \u1810 \u1811 \u1812 \u1813 \u1814 \u1815 \u1816 \u1817 \u1818 \u1819 \uff10 \uff11 \uff12 \uff13 \uff14 \uff15 \uff16 \uff17 \uff18 \uff19 
======

Here they are "literally":

======
0 1 2 3 4 5 6 7 8 9 ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ ० १ २ ३ ४ ५ ६ ७ ८ ९ ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯ ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯ ೦ ೧ ೨ ೩ ೪ ೫ ೬ ೭ ೮ ೯ ൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯ ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙ ໐ ໑ ໒ ໓ ໔ ໕ ໖ ໗ ໘ ໙ ༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩ ၀ ၁ ၂ ၃ ၄ ၅ ၆ ၇ ၈ ၉ ፩ ፪ ፫ ፬ ፭ ፮ ፯ ፰ ፱ ០ ១ ២ ៣ ៤ ៥ ៦ ៧ ៨ ៩ ᠐ ᠑ ᠒ ᠓ ᠔ ᠕ ᠖ ᠗ ᠘ ᠙ 0 1 2 3 4 5 6 7 8 9
======

For instance, \u0660-\u0669 are "Indo-Arabic" digits as used in Arab countries. \uFF10-\uFF19 are "fullwidth" variants of 0-9, etc.

[Lars H]: Quite a lot, as you can see; there are plenty of digit sets in Unicode. The authorative source on the subject should be the Unicode Character Database (see [http://www.unicode.org/Public/UNIDATA/UCD.html] for format and links), but whether Tcl really uses that is another matter. A comparison of the above with the UCD shows that `string is digit` returns `1` for most (but not quite all!) of the characters from the class ''Nd'' (decimal digits), so that is probably what it is supposed to test.

[escargo] 2006-03-14: I had my own peek starting at http://www.unicode.org/charts/symbols.html where there are links to
four PDF files under '''Numbers and Digits''': ASCII Digits, Fullwidth ASCII Digits, Number Forms, and Super and Subscripts.
All of these might legitimately contain digits.

[NEM] 2006-03-14: Interesting. Note though that most of these "digits" are not valid for
expressions fed to `[expr]`. So, if you are using `string is digit` to validate arguments to
expressions, then you probably have a bug.

[escargo]: If there are ''digits'' that are not valid when fed to `[expr]`, is that a problem in
the implementation of `[expr]`?  It almost makes me think that there ought to be a ''string normalize''
that changes Unicode digits of different stripes down to the [[0-9]] range expected by `[expr]`.

It's like if the character looks to humans like a digit but it is not one to [expr] then the problem
is really in `[expr]`.  (The solution ''might not'' be in `[expr]` though.)

[NEM]: My gut reaction is the same as yours: that `[expr]` should be enhanced to recognize
these alternative numeral characters as digits. This seems to be sensible, and would be
backwards-compatible as the characters are apparently not allowed at all by `[expr]`
currently. The expr(n) manpage could also do with clarification about what counts as an
integer or float -- at present, it just says "decimal" (or octal/hexidecimal), as far
as I can see.

[slebetman] 2006-03-15: Well, if you are not using a number larger than 18446744073709551615.99, then you can use `string is double $number` as the test instead of `string is digit $number`.
Currently under [Microsoft Windows] on 32-bit Pentium, `string is double` returns `0` for 18446744073709551616
and larger (found out by trial and error). Alternatively, if you want to check for integers and you're not using a number larger than 4294967295 then you can use `string is integer $number`:

======none
(bin) 59 % string is digit \u0669
1
(bin) 60 % string is double \u0669
0
======



** Humour **

[AMG]: Here's an April 1 [TIP] for you: [[string is string]]. :^)  It'll return true (1) whenever given one argument, no matter what that argument is, since [everything is a string]!  A -strict option would cause it to return false (0) if given empty string.  ''(update)'' On second thought, -strict would have no effect, since an empty string ''is still a string!!''

[aspect]:  while I can't invoke [http://tip.tcl.tk/323%|%TIP #323] here, I think [[string is string]] ought to return true with no argument.  Not sure about [[string is string -strict]] though!



<<categories>> Command | String Processing