Pattern Searches on Blocks of Text

WJG (19/06/17) Recently I've been doing a lot of work in the area of NLP and corpus linguistics using Tcl. Whilst I find the Tcl string really useful there are times when I want to probe a block of 'haystack' text closely by searching for patterns within the text and extracting the position of the 'needles' in the 'haystack'. The following is what I have come up with. I expect that there are ways of optimizing or enhancing it, maybe someone has done a better job elsewhere.

#!/bin/sh
# the next line restarts using tclsh \
exec tclsh "$0" "$@"

#---------------
# Perform multiple search on text block.
#---------------
# Args:
#        needle          list of search items, wildcards permitted
#        haystack        block of text upon to search
# Options:
#        -exact          (default: 1)    match all items in the needle list        
#        -nocase         (default: 0)    ignore case
#        -token          (default: "")   split needle string by token
#        -verbose        (default: 0)    return full details of the search matches
# Returns:
#        * Various patterns based upon options settings
#        * Default settings would result in returning 1 (true) or 0 (false) 
#                whilst matching an exact needle pattern within the haystack string.
#        * Setting -verbose to 1, will result in a complete listing of the 
#                occurrences of the needle in the haystack. 
#        * Each matched item will be reported in the form of a duple, where the 
#                first element contains the character position and word number of the 
#                matched item in the haystack, followed by the matched item itself.
#                e.g. {idx pos} match
#
proc pattern_search {needles haystack args} {
        
        # set defaults and assign options values from args
        array set opts [list -nocase 0 -token "" -verbose 0 -exact 1] 
        array set opts $args
                
        # tokenize needle string                
        if { $opts(-token) != "" } { set needle [split $needles $opts(-token)] }
        
        set i 0 ;# counter for successful matches per line
        set word_number 0 ;# word number
        set needle_tally ""        ;# tally of which needle patterns have been found
        set found ""
        foreach wrd $haystack {
                foreach sub $needles {
                        if { $opts(-nocase) } {
                                set id [string match -nocase $sub $wrd ]
                        } else {
                                set id [string match $sub $wrd ]
                        }
                        if { $id == 1 } { 
                                lappend needle_tally $sub 
                                lappend found $word_number $wrd
                                incr i 
                                } 
                }
                incr word_number
        }
        
        set needle_tally [lsort -unique $needle_tally]
        set needle [lsort -unique $needles]

        # exact match for occurrences of needles in haystack
        if { $opts(-exact) } {
                if { $needle_tally == $needle } {
                        if { $opts(-verbose) } {  
                                return [pattern_search_verbose $found $haystack]
                        } 
                } else { set i 0 } 
        } 
        
        if { $opts(-verbose) } {
                return [pattern_search_verbose $found $haystack] }  

        # not an exact match
        if {$i >= 1} { return 1 }
         
        # no matches whatsoever
        return 0 
}

#---------------
# get a detailed list of the needles found in the haystack
#---------------
# Args:
#        found        list of matched needles
#        haystack     text examined
# Returns:
#        formatted list of found needles in haystack
#        {character-position word-number} needle
#
proc pattern_search_verbose { found haystack } {
        set res ""
        set idx -1
        foreach {pos match} $found {
                set idx [string first $match $haystack $idx+1]
                lappend res [list $idx $pos] $match
        }

        return $res
}

# example uses
puts "1) [pattern_search "B F" "A B C D E F G" ]" ;# -nocase 0 -token "" -verbose 0 -exact 1
puts "2) [pattern_search "B H" "A B C D E F G" -nocase 1]"         ;# -token "" -verbose 0 -exact 1 
puts "3) [pattern_search "b" "A B C D E F G" -nocase 1 -exact 0]" ;# -token "" -verbose 0
puts "4) [pattern_search "d*" "A B C DOG E F G" -exact 0 -nocase 1]" ;# -token "" -verbose 0
puts "5) [pattern_search "A+D+G+Z" "A B C D E F G" -token +]" ;# -nocase 0 -verbose 0 -exact 1
puts "6) [pattern_search "A* *b* *G" "APPLE EBB C D E F EGG bottle Grape Apple APPLE Grape Garden" -exact 1 -verbose 1]" ;# -token ""
puts "7) [pattern_search "A* *b* *G" "APPLE EBB C D E F EGG bottle Grape Apple APPLE Grape Garden" -nocase 1 -exact 0 -verbose 1]" ;# -token ""
puts "8) >[pattern_search eggs basket -verbose 1]<"

ak - 2017-06-20 19:14:47

Second idea I had upon reading and skimming was this

  1. Convert the needles from glob syntax to regex (* -> .*, ? -> .)
  2. Then put all the needles together as a big alternation (i.e. foo|bar|...). This might need parens around each needle to separate them properly.
  3. Run regexp -indices -all to find the matches.

For completion, first idea was to use/implement Aho-Corasick . While that is limited to fixed strings it could be used to find candidates based on the fixed prefix (suffix?) of each pattern and then check the small set of candidates for full match.