[Richard Suchenwirth] 2002-08-01 - In [Solving cryptograms], amazing ways of automated decryption were shown, and [CL] asked for accessories e.g. for computing character and digram (2-character sequence) frequencies. I could not withstand such a little challenge, so here goes. Following [KBK], a dash (-) was taken to represent all non-letter characters. In the input the user has to take care of this himself, but the first and last character of the input text are also counted with added leading resp. trailing dash. proc freq12 string { # returns a pairlist: character/bigrams and the associated frequency set last - set n 0 foreach char [split $string- ""] { incr n inc a($char) inc a($last$char) set last $char } set res {} foreach i [lsort [array names a]] { lappend res $i [expr {$a($i)*1./$n}] } set res } proc inc {varName {amount 1}} { # create a variable if not exists, then increment upvar 1 $varName var if {![info exists var]} {set var 0} incr var $amount } ---- % freq12 "TCL-IS-A-SCRIPTING-LANGUAGE" - 0.178571428571 -A 0.0357142857143 -I 0.0357142857143 -L 0.0357142857143 -S 0.0357142857143 -T 0.0357142857143 A 0.107142857143 A- 0.0357142857143 AG 0.0357142857143 AN 0.0357142857143 C 0.0714285714286 CL 0.0357142857143 CR 0.0357142857143 E 0.0357142857143 E- 0.0357142857143 G 0.107142857143 G- 0.0357142857143 GE 0.0357142857143 GU 0.0357142857143 I 0.107142857143 IN 0.0357142857143 IP 0.0357142857143 IS 0.0357142857143 L 0.0714285714286 L- 0.0357142857143 LA 0.0357142857143 N 0.0714285714286 NG 0.0714285714286 P 0.0357142857143 PT 0.0357142857143 R 0.0357142857143 RI 0.0357142857143 S 0.0714285714286 S- 0.0357142857143 SC 0.0357142857143 T 0.0714285714286 TC 0.0357142857143 TI 0.0357142857143 U 0.0357142857143 UA 0.0357142857143 ---- [[We could have a category just for "character-based management of textual corpora", or, perhaps more conventionally, "statistical text classification" or "n-grams".]] Oft-cited Damashek paper from '95: [http://gnowledge.sourceforge.net/damashek-ngrams.pdf]. Also [http://www.sscnet.ucla.edu/geog/gessler/167-2001/ngrams.htm], [http://www.cs.umbc.edu/www/research/projects/telltale.html], and [http://www.cs.umbc.edu/~mayfield/ngrams.html]. ---- ''[DKF] -'' Here's a version extended to copy with longer sequences too. proc freq {string {maxn 2}} { # returns a pairlist: character/n-grams and the associated frequency set last {} for {set n 1} {$n < $maxn} {incr n} { append last - append string - } set n 0 foreach char [split $string ""] { incr n append last $char for {set i 0} {$i<[string length $last]} {incr i} { set key [string range $last $i end] if {[catch {incr a($key)}]} { set a($key) 1 } } set last [string range $last 1 end] } set res {} foreach i [lsort [array names a]] { lappend res $i [expr {$a($i)*1./$n}] } set res } ---- [tally: a string counter gadget] | [Arts and crafts of Tcl-Tk programming]