string mapx

Keith Vetter 2017-08-10 : On the string map page, there was a request for a version of string map that takes regular expressions instead of fixed strings. There's a quick and dirty solution on that page but it doesn't handle the tricky case of double substitutions, e.g. string map {a b b c} "a" should return just "b".

Since the request was from over 11 years ago, it's obviously not a high demand item. But I thought it would be a fun intellectual challenge. This isn't the most efficient method--it does too many regex searches and replacements, but the more efficient solution gets too complicated when you start allowing references.

You can even add it into the string ensemble via

namespace ensemble configure string -map [dict merge [namespace ensemble configure string -map] {mapx string_mapx}]

##+##########################################################################
#
# string mapx -- like string map but with regular expressions
#  see also https://wiki.tcl-lang.org/2819
# by Keith Vetter 2017-08-07
#

proc string_mapx {mapping str} {
    if {[llength $mapping] % 2 != 0} {
        return -code error "regex map list unbalanced"
    }
    # Difficulty is we must avoid double substition, meaning we cannot
    # keep matching in replaced text.  e.g. string mapx {A. BB B. CC} AA
    # should return "BB" not "CC". To do so, we maintain a list of
    # <unmatched text>, <matched text> pairs and only do matching on the
    # first part. When done we join all the pieces.

    set pieces [list $str ""]
    foreach {regex replace_with} $mapping {
        set new_pieces {}
        foreach {unmatched already_matched} $pieces {
            set mapping [_string_mapx_one $unmatched $regex $replace_with]
            lset mapping end $already_matched
            lappend new_pieces {*}$mapping
        }
        set pieces $new_pieces
    }

    set result [join $pieces ""]
    return $result
}

proc _string_mapx_one {str regex replace_with} {
    # Replaces all occurrences of regex in str with replace_with
    # The result is a list of prefix and replace_with pairs
    # e.g. _string_mapx_one "xAByACz" {A.} "M" => {x M y M z ""}
    set result {}
    set start 0
    while {1} {
        set n [regexp -start $start -indices $regex $str index]
        if {! $n} break
        lassign $index lo hi
        if {$hi < $lo} {
            error "regex $regex matches the empty string"
        }
        set prefix [string range $str $start $lo-1]
        set match [string range $str $lo $hi]

        lappend result $prefix [regsub $regex $match $replace_with]
        set start [expr {$hi + 1}]
    }
    lappend result [string range $str $start end] ""
    return $result
}
################################################################
#
# testing code
#
proc test {expected mapping str} {
    set got [string_mapx $mapping $str]
    puts "string_mapx $mapping '$str'"
    if {$got ne $expected} {
        puts "BAD: expected: '$expected'   got: '$got'"
    }
}

test xxMyyMzz {A. M} xxAByyACzz
test xxyyzz {A. ""} xxAByyABzz
test xNxMyNyMzz {A. M CC N} xCCxAByCCyABzz
test bca {a b b c c a} abc
test "NOW IS THE TIME" {{n[^ ]*} NOW {now is} BAD {i.*the} "IS THE" {t.*time} BAD time TIME} "now is the time"

test "dbca dbca" {{(a)(bc)(d)} {\3\2\1}} {abcd abcd}
test "dbcadbca"  {{(.)(..)(.)} {\3\2\1}} {abcdabcd}
test "dbca dbca" {{(a)(..)(.)} {\3\2\1}} {abcd abcd}
return

RKzn 2020-02-29: I think this has some bug. While running

test "xxAxx" {{A*} {A}} "xxAAAAxx"

On windows I got a "out of memory error", on linux (at a different machine) it ran for a long time (minutes) until I killed it. Maybe some infinite (recursive)loop...? I admit I haven't tried to read the code. Other than that, the few times I used this it worked as expected.

KPV Fixed. What was wrong is that the regex A* can match the null string. This means that the regexp call inside the loop always succeeds, matching the null string at the start. But since the ending index is -1, the start index never progresses, and we get the infinite loop.

It's unclear what the proper return value should be. You could argue that is should be an infinite string a 'A's. But I think the best response is to throw an error.

If you change the test to test "xxAxx" {{A+} {A}} xxAAAAxx, then everything works fine.

RKzn 2020-03-02 : Thank you. Good thing that my silly test caught a bug. I forgot that "A*" matches the null string, my intention was "A+", off course. Actually I have a small application at work (based on your PDF parser on the wiki, though by now almost all of your code got replaced) and I used this to build text replacement procs.