Version 6 of Frequency calculation

Updated 2002-08-19 08:53:35

Richard Suchenwirth 2002-08-01 - In Solving cryptograms, amazing ways of automated decryption were shown, and CL asked for accessories e.g. for computing character and digram (2-character sequence) frequencies. I could not withstand such a little challenge, so here goes. Following KBK, a dash (-) was taken to represent all non-letter characters. In the input the user has to take care of this himself, but the first and last character of the input text are also counted with added leading resp. trailing dash.

 proc freq12 string {
    # returns a pairlist: character/bigrams and the associated frequency
        set last -
        set n 0
        foreach char [split $string- ""] {
                incr n
                inc a($char)
                inc a($last$char)
                set last $char
        }
        set res {}
        foreach i [lsort [array names a]] {
                lappend res $i [expr {$a($i)*1./$n}]
        }
        set res
 }
 proc inc {varName {amount 1}} {
    # create a variable if not exists, then increment
    upvar 1 $varName var
    if {![info exists var]} {set var 0}
    incr var $amount
 }

 % freq12 "TCL-IS-A-SCRIPTING-LANGUAGE"
 - 0.178571428571 -A 0.0357142857143 -I 0.0357142857143 -L 0.0357142857143 
 -S 0.0357142857143 -T 0.0357142857143 A 0.107142857143 A- 0.0357142857143
 AG 0.0357142857143 AN 0.0357142857143 C 0.0714285714286 CL 0.0357142857143
 CR 0.0357142857143 E 0.0357142857143 E- 0.0357142857143 G 0.107142857143 
 G- 0.0357142857143 GE 0.0357142857143 GU 0.0357142857143 I 0.107142857143
 IN 0.0357142857143 IP 0.0357142857143 IS 0.0357142857143 L 0.0714285714286
 L- 0.0357142857143 LA 0.0357142857143 N 0.0714285714286 NG 0.0714285714286
 P 0.0357142857143 PT 0.0357142857143 R 0.0357142857143 RI 0.0357142857143
 S 0.0714285714286 S- 0.0357142857143 SC 0.0357142857143 T 0.0714285714286
 TC 0.0357142857143 TI 0.0357142857143 U 0.0357142857143 UA 0.0357142857143


[We could have a category just for "character-based management of textual corpora", or, perhaps more conventionally, "statistical text classification" or "n-grams".]

Oft-cited Damashek paper from '95: [L1 ]. Also [L2 ], [L3 ], and [L4 ].


DKF - Here's a version extended to copy with longer sequences too. It is a bit slower for the digram case, but much more flexible (not just in what it can produce, but also in what it can take as input.)

 proc freq {string {maxn 2}} {
    set len [string length $string]
    set s [string repeat - $maxn]
    append s [regsub -all {\W} [string toupper $string] -]$s
    for {set i $maxn} {$i<[string length $s]-1} {incr i} {
       for {set j [expr {$i-$maxn+1}]} {$j<=$i} {incr j} {
          set key [string range $s $j $i]
          if {[catch {incr a($key)}]} {
             set a($key) 1
          }
       }
    }
    foreach i [lsort [array names a]] {
       lappend res $i [expr {$a($i)*1./$len}]
    }
    set res
 }

It doesn't seem to give quite the same answers; not sure which code has the one-off error. (Both versions should be fine for substitution code decryption uses...)


tally: a string counter gadget | Arts and crafts of Tcl-Tk programming