Markov

I've written what I think is a slightly improved version of the Markov script. I've made it so that you can specify the length of the links between the segments in the output chain as well as the length of the segments themselves. This helps the script narrow down the data even more, allowing it to produce more meaningful output when it has a lot of data.

I've used lists to do most of the processing. I think this is probably faster, but it does mean that the script sometimes has problems with quotation marks in the data. I haven't come up with a good solution to this problem yet.

You can also specify the length of the output chain. The program will not be able to produce a full chain, however, if it selects a segment and then cannot find another segment to follow it.

I haven't written any file i/o code. The script reads and write to stdio. You'll need to use command line pipes to read data in from a file and output it to a file.

I've included a brief but perhaps confusing explanation of the program logic (which is basically the same as that used in David's script below) in comments in the code.

JMc

 #markov.tcl -- a script for processing text with markov chains
 # and producing statistically valid (even if nonsensical) output
 #written by James McElvenny <[email protected]> 20050807, with inspiration 
 # from the markov.tcl script written by David N. Welton <[email protected]>
 #this script reads and writes data to stdio. to read and write to files,
 #use command line pipes.
 #command line syntax: 
 #tclsh markov.tcl <chain_length> <link_length> <output_length> < <input_file> > <output_file>

 #EXPLANATION OF PROGRAM LOGIC
 #
 #The script breaks up the input string into segments. The length of the segments
 #is stored in the variable segment_length. Each segment must overlap by at least 
 #one word with its neighbouring segments. This overlapping part of the segment 
 #is the link. The link enables to script to determine which segments can be 
 #joined together. The length of links is stored in the variable link_length. For 
 #example, the input string 'In the sea. In the sky.' could be broken up into 
 #segments that are two words long with a link one word long. The result would be 
 #{In the} {the sea.} {sea. In} {In the} {the sky.} The script then counts the 
 #frequency of occurrence of each of the segments. In the example above, {In the} 
 #occurs twice and all the other segments occur only once. These frequencies are 
 #stored in a hash table, with the segments used as the index. The script then 
 #generates output by taking the first segment in the input and seeing which 
 #segments have a link to the first segment and so could possibly follow it. 
 #Using the example data above, the first segment would be {In the}, which could 
 #be followed by either {the sea.} or {the sky.}, since both of these segments 
 #have the link 'the'. The script will then choose one of these possibilities, 
 #giving preference to the one with the highest frequency. In this case both 
 #segments have the same frequency so that have an equal chance of being chosen. 
 #The script then takes the last words of the output chain as the link to find 
 #the next segment. The script continues until the output chain reaches the 
 #length specified in output_length or until it cannot find any more links. The 
 #shorter the segments and the links, the more random the output will be. You can 
 #make the output more closely resemble the input by increasing the size of the 
 #segments and links. If the values are too high, the output will be identical to 
 #the input.

 set segment_length [expr [lindex $argv 0] -1]
 set link_length [expr [lindex $argv 1] -1]
 set output_length [expr [lindex $argv 2] -1]
 if {$argc != 3 || $link_length >= $segment_length} {
         puts stderr "\n\nwrong number of command line arguments or link_length is greater than or equal to segment_length!\n\n"
         error {command line arguments invalid}        
 }

 set input_data [split [read stdin]]

 #initialise variables: counter = count number of words in current segment; current_segment = current segment; output = output to user; hash = table of segments and their frequencies
 set counter -1
 set current_segment ""
 set output ""
 array set hash {}

 #INPUT
 for {set x 0} {$x <= [llength $input_data]} {incr x} {
         set current_word [lindex $input_data $x]

         #if segment is not finished, add to it; if finished, store in hash table and make a new one
         if {$counter < $segment_length} {
                 lappend current_segment $current_word
                 incr counter
         } else {
                 #add current_segment to hash or increase frequency listed in hash
                 if {[info exists hash($current_segment)]} {
                         incr hash($current_segment)
                 } else {
                         set hash($current_segment) 1
                 }
                 #set the seed for output
                 if {$output == ""} {
                         set output [list $current_segment]
                 }

                 #start next segment, incorporating the current link
                 set current_segment [lrange $current_segment end-$link_length end]
                 lappend current_segment $current_word
                 set counter [expr $link_length +1]
         }
 }

 #OUTPUT
 set total 0
 for {set x 0} {$x <= $output_length} {incr x} {
         #get the current link from the end of the output chain
         set current_link [lrange [join $output] end-$link_length end]
         #find all segments in hash that start with link
         set possibles [array get hash "$current_link*"]
         #calculate probabilties
         foreach {segment number} $possibles {
                 incr total $number
         }
         set walk [expr {$total *rand()}]
         set total 0
         #randomly select one chain, taking into account probabilities
         foreach {segment number} $possibles {
                 incr total $number
                 if {$walk <= $total} {
                         lappend output [lrange [join $segment] [expr $link_length+1] end]
                         break
                 }
         }
 }

 puts stdout [join $output]

This is a half-baked Markov chain toy which mangles text. Usage:

 ./markov.tcl [http://some.url.com/file...] [file]

AK: Any place with more info on how you are doing the markov chains ?

 # markov.tcl -- by David N. Welton <[email protected]>

 package require http

 # getdata -- get data either from the web, or file system.

 proc getdata {} {
    global argv
    set flnm [lindex $argv 0]
    if { $flnm == "" } {
        puts stderr "Please supply a filename!"
        exit 1
    }

    # If the file starts with http://, fetch it from the web.
    if { [ string compare -nocase -length 7 http:// $flnm ] } {
        set page [::http::geturl $flnm]
        set data [http::data $page]
        regsub -all {<[^>]*>} $data "" data
        regsub -all {&[^;]*} $data "" data
    } else {
        set fl [open $flnm r]
        set data [read $fl]
        close $fl
    }
    return $data
 }

 # Markov -- run markov chains of length 'chainlength' on data.

 proc Markov {data chainlength} {
    set idx 0
    set wd ""
    incr chainlength -1
    for {set i 0} {$i < $chainlength} {incr i} {lappend prev {}}
    set save ""
    array set hash {}

    # This loop could probably be rewritten using lists somehow,
    # making it faster.

    while { $idx < [string length $data] } {
        set chr [string index $data $idx]
        if { [string is alpha $chr] } {
            lappend prev $wd
            set prev [lrange $prev 1 end]
            set ws [string wordstart $data $idx]
            set we [string wordend $data $idx]
            set wd [string tolower [string range $data $ws [expr $we - 1]]]

            set key [concat $prev $wd]
            if { [info exists hash($key)] } {
                incr hash($key)
            } else {
                set hash($key) 1
            }
            set idx $we
        } elseif { $chr == "." || $chr == "!" || $chr == "?" || $chr == "," } {
            lappend prev $wd
            set prev [lrange $prev 1 end]
            set wd $chr
            set key [concat $prev $wd]
            if { [info exists hash($key)] } {
                incr hash($key)
            } else {
                set hash($key) 1
            }
        }
        if { $save == "" } {
            for {set i 0} {$i < $chainlength} {incr i} {
                if { [lindex $prev $i] == "" } {
                    break
                } else {
                    set save $prev
                }
            }
        }

        incr idx
    }

    foreach {k v} [array get hash] {
        lappend pairs [list $k $v]
    }
    set pairs [lsort -integer -index 1 $pairs]
    set l [llength $pairs]
    set i 0

    set wd $save
    set oput "$wd "
    while { $i < $l } {
        set possibles [array get hash "$wd *"]
        set tot 0
        foreach {k v} $possibles {
            incr tot $v
        }
        set walk [expr {$tot * rand()}]
        set tot 0
        foreach {k v} $possibles {
            incr tot $v
            if { $walk <= $tot } {
                set use [string range $k [expr [string last " " $k] + 1] end]
 #              puts "possibles were '$possibles', using '$use' because of walk $walk and $tot"
                break
            }
        }
        append oput "$use "
        lappend wd $use
        set wd [lrange $wd 1 end]
        incr i
    }
    return $oput
 }

 puts [Markov [getdata] 4]