English number reader

Richard Suchenwirth - From a harmless question on news:comp.lang.tcl , whether Tcl has an equivalent to isdigit() in C, evolved the following code that checks whether a string is a well-formed English number.

  proc isnumber s {
        set re ^((one|two|three|four|five|six|seven|eight|nine|ten
        append re |eleven|twelve|thir|fif|teen|twen|ty|forty|fifty
        append re |eighty|hundred|thousand|and
        append re {)[ -]?)+$}
        regexp $re $s
  }
  isnumber {nineteen hundred and seventy-six}
  1
  isnumber twenty-six
  1
  isnumber twenty-something
  0

Dan Smart commented:

 Hmm,
 isnumber {teen thousand and ty nine}
 1
 isnumber teen
 1
 isnumber ty
 1
 isnumber fif
 1
 isnumber {one and thousand and two and fif and thir and hundred}
 1

Ooops - I guess it's back to the drawing board...

It was. Here's version 0.2, written at midnight:

 proc en2num s {
    array set dic {
        zero 0  one 1 two 2 three 3 four 4 five 5 six 6 seven 7 
        eight 8 nine 9 ten 10 eleven 11 twelve 12 thirteen 13
        fifteen 15 eighteen 18 twenty 20 thirty 30 forty 40 
        fifty 50 eighty 80 score 20 hundred 100 thousand 1000 
        million 1000000 millions 1000000
    }
    regsub -all " and |-" [string trim $s] " " s
    set res [list] ;# will become the translation to math
    foreach i [split $s] {
        if [info exists dic($i)] {
            if {$dic($i)>=1000 && [llength $res]} {
                regsub 000 $res "" res ;# will multiply by 1000 later
                set res "($res)"
            }
            if {($dic($i)>99|$i=="score") && [llength $res]} {
                lappend res *
            } else {lappend res +}
            lappend res $dic($i)
        } elseif {[regexp (.+)teen $i -> t]&&[info exists dic($t)]} {
            lappend res + [expr $dic($t)+10]
        } elseif {[regexp (.+)ty $i -> t]&&[info exists dic($t)]} {
            lappend res + [expr $dic($t)*10]
        } else {return -code error "$s is not a number: $i"}
    }
    expr $res
 }
 proc en:isnum s {expr ![catch {en2num $s}]}

This uses a parser that extracts the value of an English number, if possible, by building up, and finally evaluating, an expression. It passes Dan's test cases, even supports some outdated formats (backwards compatible ;-)

 en2num {four score and seven}
 87
 en2num {one and twenty}
 21

but still allows more than it should, so acts like a language-driven adding machine:

 en2num {one two three}
 6
 en2num {fifty fifty}
 100

To fix this, one would need a kind of slots for ones, tens, and hundreds, that could be filled maximally once, and shift these for thousands, millions, ...


For translating numbers to natural languages, see also the Bag of number/time spellers