Parse Quote

The problem often comes up "I have a string with quoted sections, I want to perform (some operation) on it, how do I do it?" A lot of time is spent playing with regexps, which can't really ever work. I decided to try to write a canonical parser for quoted expressions. I hope others will suggest better models, and that in the end we will have an implementation worthy of going into tcllib. CMcC 26Jun2012

There's also a Parse Parenthesis equivalent.

proc quopar {str {q \"}} {
    set depth 0
    set result {}
    set skip 0
    foreach c [split $str ""] {
        if {$c eq "\\"} {
            append run $c
            incr skip
        } elseif {$skip} {
            append run $c
            set skip 0
            continue
        }
        if {$c eq $q} {
            if {[info exists run]} {
                lappend result $depth $run
                unset run
            }
            set depth [expr {($depth+1)%2}]
        } else {
            append run $c
        }
    }

    if {$depth > 0} {
        error "quopar dangling '$q' in '$str'"
    }
    if {[info exists run]} {
        lappend result $depth $run
    }
    return $result
}

if {[info exists argv0] && $argv0 eq [info script]} {
    package require tcltest
    namespace import ::tcltest::*
    verbose {pass fail error}
    set count 0
    foreach {str result} {
        {""} ""
        {"\""} {1 {\\"}}
        {""""} ""
        {"moop"} "1 moop"
        {pebbles "fred wilma" bambam "barney betty"} "0 {pebbles } 1 {fred wilma} 0 { bambam } 1 {barney betty}"
    } {
        test quopar-[incr count] {} -body {
            quopar $str
        } -result $result
    }

    foreach {str} {
        {"}
        {"""}
    } {
        test quopar-[incr count] {} -body {
            quopar $str
        } -match glob -result * -returnCodes 1
    }
}

AMG: Disabled syntax highlighting for the above because (ironically? poetically?) it fails to parse quotes correctly! The \" on the first line screws it up.


AMG: I know the text at the top of the page asserts that regular expressions can't work, but here's a regular expression that actually has worked for me quite well. It finds strings starting and ending with double quotes, and it ignores closing double quotes preceded by an odd number of backslashes. It works by matching a sequence of atoms between the double quotes, where the acceptable atoms are one of two things: (1) any single character other than a quote or backslash, or (2) a backslash followed by any character at all. Case (2) is interesting because the character following the backslash can be anything that doesn't fit into case (1).

regexp {\"(?:[^\"\\]|\\.)*\"} $str

You may want more features, for instance also identifying strings that aren't quoted but are terminated by whitespace, or producing a list of all "words" in a list, some of which can be quoted. I have this code in Wibble Implementation.

Want an example of something regular expressions really cannot handle? Matching parentheses, braces, or brackets. That can't be done because there's no way to recurse or otherwise count matching pairs. Double quotes are easy because there's no nesting.

But if your quoted string can contain square brackets (e.g. it's part of a Tcl script), now you're in trouble again since you need to ignore quotes inside the outermost brackets, yet it's not possible to keep track of matching brackets!