Version 0 of Solving cryptograms

Updated 2002-07-31 17:40:19

# KBK - 30 July 2002

# Purpose: Demonstrate the method of simulated annealing as an approach to combinatorial optimization in Tcl. (Incidentally, show off a fun weekend project.)

# The code in Solving cryptarithms has led several people to ask me about solving cryptograms that are expressed in simple substitution ciphers. Let's see how we can do that in Tcl. In the course of doing so, we'll also develop a little framework that's applicable to a wide variety of combinatorial optimization problems.

# Let's begin by getting some preliminaries out of the way. We start with a table of the frequencies of English letters. The hyphen represents the space between words. (The frequencies were taken by counting occurrences in the full text of Dickens's David Copperfield.)

 array set F {
     - 0.197258 A 0.065493 B 0.012243 C 0.018083 D 0.037663 E 0.097011
     F 0.017431 G 0.016856 H 0.048503 I 0.058153 J 0.000786 K 0.007177
     L 0.030411 M 0.025241 N 0.054895 O 0.062158 P 0.013657 Q 0.000758
     R 0.046231 S 0.048502 T 0.071069 U 0.022746 V 0.007432 W 0.020790
     X 0.001163 Y 0.018124 Z 0.000167
 }

# We'll also want a table of the frequencies of digrams -- two-character combinations. Once again, the hyphen represents the space between words. The table is sparse; any pairs not in the table did not occur in David Copperfield.

 array set E {
     -A 0.022940 -B 0.008593 -C 0.006825 -D 0.006257 -E 0.003369 -F 0.005941
     -G 0.003215 -H 0.014539 -I 0.017355 -J 0.000515 -K 0.001265 -L 0.005247
     -M 0.013583 -N 0.004215 -O 0.011031 -P 0.004979 -Q 0.000466 -R 0.003857
     -S 0.015709 -T 0.025998 -U 0.002576 -V 0.001063 -W 0.014617 -X 0.000024
     -Y 0.003075 -Z 0.000003
     A- 0.005115 AA 0.000001 AB 0.001197 AC 0.001907 AD 0.003783 AE 0.000009
     AF 0.000604 AG 0.001207 AH 0.000174 AI 0.003552 AJ 0.000006 AK 0.001081
     AL 0.003841 AM 0.001862 AN 0.012617 AO 0.000021 AP 0.001044 AR 0.005429
     AS 0.007007 AT 0.008536 AU 0.000985 AV 0.002215 AW 0.001161 AX 0.000029
     AY 0.002053 AZ 0.000057
     B- 0.000060 BA 0.000728 BB 0.000087 BC 0.000001 BD 0.000009 BE 0.004714
     BH 0.000001 BI 0.000284 BJ 0.000134 BL 0.001347 BM 0.000010 BN 0.000002
     BO 0.001418 BR 0.000699 BS 0.000242 BT 0.000121 BU 0.001528 BV 0.000002
     BW 0.000001 BY 0.000854
     C- 0.000123 CA 0.002347 CB 0.000002 CC 0.000371 CD 0.000002 CE 0.002848
     CH 0.003212 CI 0.000620 CK 0.001114 CL 0.000637 CN 0.000001 CO 0.003868
     CP 0.000002 CQ 0.000036 CR 0.000742 CS 0.000014 CT 0.001494 CU 0.000564
     CY 0.000085
     D- 0.024222 DA 0.000943 DB 0.000003 DC 0.000003 DD 0.000528 DE 0.003774
     DF 0.000038 DG 0.000220 DH 0.000011 DI 0.002145 DJ 0.000009 DK 0.000025
     DL 0.000592 DM 0.000070 DN 0.000266 DO 0.002622 DR 0.000554 DS 0.000839
     DT 0.000003 DU 0.000297 DV 0.000078 DW 0.000031 DY 0.000392
     E- 0.034602 EA 0.005039 EB 0.000101 EC 0.001696 ED 0.007920 EE 0.003237
     EF 0.000935 EG 0.000862 EH 0.000202 EI 0.000796 EJ 0.000036 EK 0.000115
     EL 0.003764 EM 0.001941 EN 0.007652 EO 0.000138 EP 0.001059 EQ 0.000120
     ER 0.014433 ES 0.005485 ET 0.002700 EU 0.000018 EV 0.001696 EW 0.000512
     EX 0.000892 EY 0.001045 EZ 0.000015
     F- 0.006693 FA 0.001119 FD 0.000002 FE 0.001462 FF 0.000713 FI 0.001599
     FL 0.000315 FM 0.000001 FN 0.000002 FO 0.003055 FP 0.000005 FR 0.001081
     FS 0.000006 FT 0.000646 FU 0.000707 FY 0.000023
     G- 0.006674 GA 0.000925 GB 0.000001 GE 0.001734 GG 0.000569 GH 0.002312
     GI 0.000615 GL 0.000422 GM 0.000011 GN 0.000383 GO 0.001645 GP 0.000002
     GR 0.000906 GS 0.000248 GT 0.000058 GU 0.000325 GY 0.000026 GZ 0.000001
     H- 0.005420 HA 0.009784 HB 0.000028 HC 0.000001 HD 0.000019 HE 0.020033
     HF 0.000044 HG 0.000013 HH 0.000002 HI 0.006512 HL 0.000053 HM 0.000060
     HN 0.000009 HO 0.003884 HQ 0.000002 HR 0.000348 HS 0.000050 HT 0.001622
     HU 0.000373 HW 0.000008 HY 0.000239
     I- 0.007305 IA 0.000634 IB 0.000246 IC 0.002606 ID 0.003148 IE 0.002182
     IF 0.001460 IG 0.001539 IH 0.000004 II 0.000023 IK 0.000481 IL 0.002123
     IM 0.002284 IN 0.015038 IO 0.002252 IP 0.000386 IQ 0.000006 IR 0.001880
     IS 0.005908 IT 0.007584 IU 0.000035 IV 0.000893 IW 0.000002 IX 0.000086
     IZ 0.000052
     J- 0.000014 JA 0.000108 JC 0.000001 JE 0.000199 JI 0.000058 JO 0.000207
     JU 0.000200
     K- 0.002221 KA 0.000021 KB 0.000002 KC 0.000003 KD 0.000002 KE 0.002250
     KF 0.000163 KG 0.000005 KH 0.000008 KI 0.001170 KL 0.000153 KM 0.000006
     KN 0.000898 KO 0.000006 KP 0.000001 KR 0.000001 KS 0.000215 KU 0.000001
     KW 0.000011 KY 0.000041
     L- 0.004200 LA 0.002130 LB 0.000024 LC 0.000035 LD 0.003117 LE 0.005137
     LF 0.000850 LG 0.000015 LI 0.003453 LK 0.000311 LL 0.003912 LM 0.000115
     LN 0.000030 LO 0.002623 LP 0.000096 LR 0.000044 LS 0.000402 LT 0.000438
     LU 0.000361 LV 0.000115 LW 0.000194 LX 0.000003 LY 0.002807
     M- 0.003555 MA 0.002704 MB 0.000438 MC 0.000001 MD 0.000002 ME 0.006406
     MF 0.000079 MG 0.000001 MI 0.002585 ML 0.000028 MM 0.000373 MN 0.000057
     MO 0.002059 MP 0.000774 MR 0.001706 MS 0.000410 MT 0.000004 MU 0.000894
     MY 0.003167
     N- 0.014547 NA 0.000773 NB 0.000038 NC 0.001723 ND 0.010208 NE 0.004940
     NF 0.000351 NG 0.007517 NH 0.000049 NI 0.001510 NJ 0.000058 NK 0.000697
     NL 0.000453 NM 0.000021 NN 0.000556 NO 0.003858 NP 0.000018 NQ 0.000087
     NR 0.000019 NS 0.001559 NT 0.004613 NU 0.000200 NV 0.000153 NW 0.000037
     NX 0.000036 NY 0.000872 NZ 0.000001
     O- 0.009321 OA 0.000455 OB 0.000458 OC 0.000716 OD 0.001031 OE 0.000155
     OF 0.005242 OG 0.000206 OH 0.000209 OI 0.000592 OJ 0.000023 OK 0.001216
     OL 0.001511 OM 0.003392 ON 0.008403 OO 0.002924 OP 0.001198 OQ 0.000006
     OR 0.006493 OS 0.001375 OT 0.003487 OU 0.009140 OV 0.000864 OW 0.003366
     OX 0.000066 OY 0.000290 OZ 0.000017
     P- 0.001283 PA 0.001463 PB 0.000003 PE 0.002990 PF 0.000001 PH 0.000167
     PI 0.000684 PK 0.000008 PL 0.001126 PM 0.000002 PN 0.000001 PO 0.001696
     PP 0.001249 PR 0.001637 PS 0.000268 PT 0.000398 PU 0.000513 PW 0.000010
     PY 0.000160
     Q- 0.000001 QU 0.000757 QY 0.000001
     R- 0.014025 RA 0.002704 RB 0.000096 RC 0.000329 RD 0.001292 RE 0.010386
     RF 0.000696 RG 0.000258 RH 0.000114 RI 0.002774 RJ 0.000004 RK 0.000533
     RL 0.000498 RM 0.000603 RN 0.001034 RO 0.003252 RP 0.000201 RR 0.000685
     RS 0.002181 RT 0.001965 RU 0.000585 RV 0.000345 RW 0.000135 RY 0.001535
     S- 0.018877 SA 0.003138 SB 0.000036 SC 0.000510 SD 0.000021 SE 0.005426
     SF 0.000076 SG 0.000029 SH 0.003544 SI 0.002495 SK 0.000322 SL 0.000308
     SM 0.000249 SN 0.000105 SO 0.002709 SP 0.000965 SQ 0.000030 SR 0.000006
     SS 0.002322 ST 0.005491 SU 0.001528 SW 0.000192 SY 0.000125
     T- 0.021340 TA 0.002065 TB 0.000003 TC 0.000198 TE 0.005652 TF 0.000101
     TG 0.000001 TH 0.020133 TI 0.004103 TJ 0.000001 TL 0.001343 TM 0.000079
     TN 0.000085 TO 0.008161 TP 0.000005 TR 0.001880 TS 0.000933 TT 0.002187
     TU 0.001132 TW 0.000487 TX 0.000002 TY 0.001172 TZ 0.000006
     U- 0.002029 UA 0.000329 UB 0.000342 UC 0.000938 UD 0.000300 UE 0.000416
     UF 0.000086 UG 0.001192 UI 0.000644 UK 0.000002 UL 0.002572 UM 0.000528
     UN 0.002710 UO 0.000029 UP 0.001343 UQ 0.000002 UR 0.003588 US 0.002380
     UT 0.003290 UV 0.000002 UX 0.000007 UY 0.000014 UZ 0.000004
     V- 0.000014 VA 0.000327 VE 0.005770 VI 0.000872 VO 0.000321 VU 0.000004
     VY 0.000124
     W- 0.002419 WA 0.004632 WB 0.000470 WC 0.000032 WD 0.000019 WE 0.002980
     WF 0.000030 WH 0.003796 WI 0.003309 WK 0.000005 WL 0.000137 WM 0.000001
     WN 0.000830 WO 0.001691 WP 0.000001 WR 0.000229 WS 0.000163 WT 0.000018
     WU 0.000012 WY 0.000015 WZ 0.000001
     X- 0.000111 XA 0.000064 XB 0.000001 XC 0.000187 XE 0.000077 XF 0.000006
     XH 0.000015 XI 0.000117 XL 0.000005 XO 0.000004 XP 0.000318 XQ 0.000003
     XT 0.000217 XU 0.000009 XV 0.000006 XX 0.000019 XY 0.000002
     Y- 0.013085 YA 0.000106 YB 0.000074 YC 0.000002 YD 0.000003 YE 0.000777
     YF 0.000015 YG 0.000008 YH 0.000010 YI 0.000259 YK 0.000001 YL 0.000017
     YM 0.000049 YN 0.000004 YO 0.002682 YP 0.000011 YR 0.000013 YS 0.000785
     YT 0.000198 YW 0.000026 YZ 0.000001
     Z- 0.000005 ZA 0.000014 ZE 0.000089 ZI 0.000028 ZL 0.000008 ZO 0.000005
     ZU 0.000001 ZY 0.000008 ZZ 0.000010
 }

# In the method of simulated annealing, we are attempting to come up with some configuration of things that has the lowest cost (or, equivalently, the highest value). The classical example that everyone uses to illustrate the technique is the Traveling Salesman Problem, where the configuration is the order in which the salesman visits a set of cities and the cost is the distance that he travels - with the objective to minimize his distance while visiting each city exactly once.

# For solving cryptograms, we'll use as a configuration the argument that will be passed to string map to decrypt the text. For a cost function, we'll do some simple statistics on the digrams in the candidate for deciphered text. For each pair of letters, we'll calculate the absolute value of the difference between D(ab), the frequency of the digram in the decrypted text, and E(ab), the frequency of the digram in David Copperfield. The sum of these absolute differences gives our cost.

 proc cost { mapping } {
     global D E
     set x 1.0
     foreach { pair weight } [array get D] {
         set pair2 [string map $mapping $pair]
         if { [info exists E($pair2)] } {
             set x [expr { $x - $E($pair2) + abs( $D($pair) - $E($pair2) ) }]
         } else {
             set x [expr { $x + $D($pair) }]
         }
     }
     return $x
 }

# In order to work with the letter frequencies in the ciphertext, we need to count them first. The 'freq' procedure does the counting. Letter frequencies go in the array C, and digram frequencies into D. All strings of non-letters are turned into single hyphens. There are no doubt slicker ways to do this, but this works adequately well.

 proc freq { txt } {
     global C D
     foreach c { - A B C D E F G H I J K L M N O P Q R S T U V W X Y Z} {
         set C($c) 0
     }
     set l [string length $txt]
     set last -
     set n 0
     for { set i 0 } { $i < $l } { incr i } {
         set c [string index $txt $i]
         if { [string is alpha $c] } {
             incr C($c)
             set d $last$c
             if { [info exists D($d)] } {
                 incr D($d)
             } else {
                 set D($d) 1
             }
             set last $c
             incr n
         } elseif { [string is alpha $last] } {
             set c -
             incr C($c)
             set d $last$c
             if { [info exists D($d)] } {
                 incr D($d)
             } else {
                 set D($d) 1
             }
             set last $c
             incr n
         }
     }
     foreach { x count } [array get C] {
         set C($x) [expr { double($count) / $n }]
     }
     foreach { x count } [array get D] {
         set D($x) [expr { double($count) / $n }]
     }
     return $n
 }

# Simulated annealing works by doing the following:

# 1. Choose an initial configuration and compute its cost.

# 2. Repeatedly make random changes to the configuration. For each changed configuration, compute its cost, and the difference delta between its cost and the cost of the current configuration.

# 3. Choose a random number between 0 and 1 inclusive. If this random number is less than exp( delta / T ), for a parameter T, let the new configuration replace the current one.

# While all this is being done, T is reduced according to a predetermined schedule.

# The following procedure does this in a fairly generic manner. It accepts the initial configuration, a schedule that is alternating values for T and number of iterations to keep it there, and three procedures:

# costProc - Accepts a configuration and returns its cost. # changeProc - Makes a random change to a configuration and returns the next configuration # reportProc - Reports on the progress of the optimization when T is reduced.

# Note that simulated annealing is a general technique with many applications. In addition to cryptanalysis, I've used it in several scheduling problems, and as a way to arrange the nodes of complicated message sequence charts so as to minimize the line crossings. It's a good thing to have in your toolbox when you need a quick (to program - it's slow to run) solution to a hard optimization problem.

 proc anneal { configuration schedule
               costProc changeProc { reportProc concat } } {

     set currCost [uplevel 1 $costProc [list $configuration]]
     foreach { T maxIter } $schedule {
         set Nsucc 0
         for { set j 0 } { $j < $maxIter } { incr j } {
             set nextConf [uplevel 1 $changeProc [list $configuration]]
             set nextCost [uplevel 1 $costProc [list $nextConf]]
             set delta [expr { $nextCost - $currCost }]
             if { rand() < exp ( -$delta / $T ) } {
                 set configuration $nextConf
                 set currCost $nextCost
                 incr Nsucc
                 if { $Nsucc >= $maxIter / 10 } {
                     break
                 }
             }
         }
         eval $reportProc [list $T $currCost $Nsucc $j $configuration]
     }

     return $configuration

 }

# In order for this to work, we need a procedure to make random changes. Each change will simply be interchanging two letters in the permutation. (We need the lset command that first appeared in Tcl 8.4 - simulate it if it isn't there.)

 if { [string compare lset [info commands lset]] } {
     proc K { x y } { set x }
     proc lset { listVar n element } {
         upvar 1 $listVar list
         return [set list [lreplace [K $list [set list {}]] $n $n $element]]
     }
 }
 proc change { mapping } {
     set i1 [expr { 2 * int( 27. * rand() ) + 1 }]
     set i2 [expr { 2 * int( 26. * rand() ) + 1 }]
     if { $i2 >= $i1 } {
         incr i2 2
     }
     set t [lindex $mapping $i1]
     lset mapping $i1 [lindex $mapping $i2]
     lset mapping $i2 $t
     return $mapping
 }

# We also need to come up with an initial guess about the mapping. We do this by letting the commonest symbol in the ciphertext represent the space, the next the E, the next the T, and so on in decreasing in order of frequency.

 proc guess1 {} {
     global C F
     foreach { letter freq } [array get C] {
         lappend freq1 [list $letter $freq]
     }
     foreach { letter freq } [array get F] {
         lappend freq2 [list $letter $freq]
     }
     foreach \
         pair1 [lsort -real -decreasing -index 1 $freq1] \
         pair2 [lsort -real -decreasing -index 1 $freq2] {
             lappend map [lindex $pair1 0] [lindex $pair2 0]
         }
     return $map
 }

# We'll also have a procedure report intermediate results:

 proc report { T cost Nsucc j map } {
     global cyphertext
     puts " T=$T. $Nsucc configurations out of $j were accepted."
     puts " Cost of current solution is $cost:"
     puts [string map $map $cyphertext]
 }

# OK, we're ready to go. Here's a cryptogram made by enciphering a paragraph taken from a popular web page advocating Tcl:

 set cyphertext [string toupper {
  TDSJQUJOH JT B UFDIOPMPHZ XIPTF UJNF IBT DPNF. TDSJQUJOH MBOHVBHFT
  IBWF FYJTUFE GPS EFDBEFT CVU UIFJS SPMF IBT JODSFBTFE ESBNBUJDBMMZ
  PWFS UIF MBTU UFO ZFBST. UIJT QBHF FYQMBJOT XIZ TDSJQUJOH MBOHVBHFT
  TVDI BT UDM XJMM CF VTFE GPS NBOZ PG UIF NPTU JNQPSUBOU BQQMJDBUJPOT
  PG UIF OFYU DFOUVSZ BOE XIZ ZPV TIPVME DPOTJEFS VTJOH B TDSJQUJOH
  MBOHVBHF GPS ZPVS GVUVSF BQQMJDBUJPOT.
 }]

# We calculate its letter and digram frequencies:

 freq $cyphertext

# We come up with an initial guess at the key:

 set key [guess1]

puts " Initial guess at key: $key"

 puts " Cost of initial guess is [cost $key]:"
 puts [string map $key $cyphertext]

# And we turn simulated annealing loose to improve it:

 set key [anneal $key {
     0.01 2500
     0.007 2500
     0.005 2500
     0.001 2500
     0.0005 2500
 } cost change report]
 puts " Recovered key; $key"

 exit

Running this program gives the following output. The initial guess has at least correctly identified the space and the 'E'. Successive rounds get ever better as the parameter T is reduced. (Note that 100% of the plaintext was recovered, unusual for a cryptogram this short.) It's bad, however, to start with too small a value of T. Essentially, the value of T balances the program's ability to "make mistakes" early on with its propensity to get stuck in a local optimum and make no further progress.

 Initial guess at key: - - F E B T T A U O J I O N P H S S D R M D I L V M H U Q W E Y Z C G F N G X P Y B C V W K A X R J K Q L Z
 Cost of initial guess is 1.19231564611:

  ARSIWOINU IA T OERLNHDHUC PLHAE OIGE LTA RHGE. ARSIWOINU DTNUMTUEA
  LTKE EBIAOEY FHS YERTYEA VMO OLEIS SHDE LTA INRSETAEY YSTGTOIRTDDC
  HKES OLE DTAO OEN CETSA. OLIA WTUE EBWDTINA PLC ARSIWOINU DTNUMTUEA
  AMRL TA ORD PIDD VE MAEY FHS GTNC HF OLE GHAO IGWHSOTNO TWWDIRTOIHNA
  HF OLE NEBO RENOMSC TNY PLC CHM ALHMDY RHNAIYES MAINU T ARSIWOINU
  DTNUMTUE FHS CHMS FMOMSE TWWDIRTOIHNA.

 T=0.01. 250 configurations out of 1551 were accepted.
 Cost of current solution is 0.82446552815:

  SGRIQTINL IS A TEGHNOMOLD WHOSE TIPE HAS GOPE. SGRIQTINL MANLUALES
  HAVE EXISTEY FOR YEGAYES CUT THEIR ROME HAS INGREASEY YRAPATIGAMMD
  OVER THE MAST TEN DEARS. THIS QALE EXQMAINS WHD SGRIQTINL MANLUALES
  SUGH AS TGM WIMM CE USEY FOR PAND OF THE POST IPQORTANT AQQMIGATIONS
  OF THE NEXT GENTURD ANY WHD DOU SHOUMY GONSIYER USINL A SGRIQTINL
  MANLUALE FOR DOUR FUTURE AQQMIGATIONS.

 T=0.007. 224 configurations out of 2500 were accepted.
 Cost of current solution is 0.802785597855:

  SPRIVTING IS A TEPHNOLOGY MHOSE TICE HAS POCE. SPRIVTING LANGUAGES
  HAKE EQISTED WOR DEPADES BUT THEIR ROLE HAS INPREASED DRACATIPALLY
  OKER THE LAST TEN YEARS. THIS VAGE EQVLAINS MHY SPRIVTING LANGUAGES
  SUPH AS TPL MILL BE USED WOR CANY OW THE COST ICVORTANT AVVLIPATIONS
  OW THE NEQT PENTURY AND MHY YOU SHOULD PONSIDER USING A SPRIVTING
  LANGUAGE WOR YOUR WUTURE AVVLIPATIONS.

 T=0.005. 149 configurations out of 2500 were accepted.
 Cost of current solution is 0.76445566756:

  SCRIKTING IS A TECHNOLOGY WHOSE TIME HAS COME. SCRIKTING LANGUAGES
  HAPE EQISTED FOR DECADES BUT THEIR ROLE HAS INCREASED DRAMATICALLY
  OPER THE LAST TEN YEARS. THIS KAGE EQKLAINS WHY SCRIKTING LANGUAGES
  SUCH AS TCL WILL BE USED FOR MANY OF THE MOST IMKORTANT AKKLICATIONS
  OF THE NEQT CENTURY AND WHY YOU SHOULD CONSIDER USING A SCRIKTING
  LANGUAGE FOR YOUR FUTURE AKKLICATIONS.

 T=0.001. 59 configurations out of 2500 were accepted.
 Cost of current solution is 0.739569876676:

  SCRIPTING IS A TECHNOLOGY WHOSE TIME HAS COME. SCRIPTING LANGUAGES
  HAVE EXISTED FOR DECADES BUT THEIR ROLE HAS INCREASED DRAMATICALLY
  OVER THE LAST TEN YEARS. THIS PAGE EXPLAINS WHY SCRIPTING LANGUAGES
  SUCH AS TCL WILL BE USED FOR MANY OF THE MOST IMPORTANT APPLICATIONS
  OF THE NEXT CENTURY AND WHY YOU SHOULD CONSIDER USING A SCRIPTING
  LANGUAGE FOR YOUR FUTURE APPLICATIONS.

 T=0.0005. 42 configurations out of 2500 were accepted.
 Cost of current solution is 0.739569876676:

  SCRIPTING IS A TECHNOLOGY WHOSE TIME HAS COME. SCRIPTING LANGUAGES
  HAVE EXISTED FOR DECADES BUT THEIR ROLE HAS INCREASED DRAMATICALLY
  OVER THE LAST TEN YEARS. THIS PAGE EXPLAINS WHY SCRIPTING LANGUAGES
  SUCH AS TCL WILL BE USED FOR MANY OF THE MOST IMPORTANT APPLICATIONS
  OF THE NEXT CENTURY AND WHY YOU SHOULD CONSIDER USING A SCRIPTING
  LANGUAGE FOR YOUR FUTURE APPLICATIONS.

 Recovered key; - - F E B A T S U T J I O N P O S R D C M L I H V U H G Q P E D Z Y G F N M X W Y X C B W V A Z R K K J L Q

Alas, the statistical techniques presented here don't work well on newspaper cryptograms, which are too short for the frequency statistics to be meaningful.

Category Cryptography