'''[Split]ting strings with embedded strings'''

[Richard Suchenwirth] 2001-05-31: - Robin Lauren <robin@lauren.net> wrote in [comp.lang.tcl]:

I want to split an argument which contains spaces within quotes into proper name=value pairs.  But I can't :)

Consider this example:

======
set tag {body type="text/plain" title="This is my body"} 
set element [lindex $tag 0]
set attributes [lrange $tag 1 end] ;# *BZZT!* Wrong answer!
======

My attributes becomes the list {type="text/plain"} {title="This} {is} {my} {body"} (perhaps even with the optional backslash before the quotes), which isn't really what i had in mind.


** Answer **


***Should you try to solve this yourself anyway?***

The problem statement doesn't specifically say that the strings to be split are SGML/HTML/XML attribute lists, but it certainly looks like it. A minimal solution splits the string provided above, but a flexible solution should be able to split strings that bend the rules in different ways, preferably allowing for all variations that are legal in the chosen syntax.

[PYK] 2014-03-03: The proper solution is to use a full [XML] parser like [tDOM], because, as the examples below illustrate, any other solution will have holes in its coverage, as they do not take every aspect of SGML/HTML/XML attribute syntax into account. 

[PL]: That is an important consideration when deciding how to solve just about any problem, and, yes, the full definition of attribute syntax has more rules than Tcl does, so there are certainly lots of things that can trip you up. However, 

   * Such a parser doesn't accept element fragments.
   * Most attribute definitions in the wild are pretty regular.
   * While ready-made solutions to ''most'' problems exist, it is still worthwhile to discuss the basics of how to achieve such solutions.

So,

   * if the element-attribute strings are still in place in the XML or HTML document, or
   * the strings are very irregular, or
   * what you need is a professional-level solution,

something like tDOM is likely to be the best choice. If, on the other hand

   * the fragments are already extracted, and
   * are guaranteed to be regular (or can easily be made regular),

or you are simply curious about how to extract data from strings and feel like playing around with it some, you might as well try one of the following methods.


***Never mind, let's just look at some solutions***

(Solution work initiated by [Richard Suchenwirth] 2001-05-31, initial regular expression solution suggested by [MG]. Solutions reworked and significantly extended by [Peter Lewerin] 2014-03-09)

If there are always exactly two attribute definitions following an element name, one simple solution is to `[scan]` the string, and then enclose the name/value pairs in sublists:

======
% set parts [scan $tag {%s %[^ =] = "%[^"]" %[^ =] = "%[^"]"}]
# -> body type text/plain title {This is my body}
% set result [list]
% foreach {name value} [lrange $parts 1 end] { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}
======

For a more general solution, where there can be less or more than two definitions, a `[regexp]` match might be useful:

======
% set matches [regexp -inline -all {(\S+?)\s*=\s*"(.*?)"} $tag]
# -> type=\"text/plain\" type text/plain {title="This is my body"} title {This is my body}
% set result [list]
% foreach {- name value} $matches { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}
======

(Note that here, the `[foreach]` command extracts ''three'' values from the list during each iteration: the first value (stored in the variable named `-`) is just discarded.)

The problem can also be solved using list/string manipulation commands, but then we need to make sure that we see the data in the same way as Tcl does. To a human, ''$tag'' intuitively looks like a list of three items, but according to Tcl list syntax, it has 6 items, and the second item, for example, contains two literal quotes.

======
% llength $tag
# -> 6
% lmap item $tag { format "{%s}" $item }
# -> {{body}} {{type="text/plain"}} {{title="This}} {{is}} {{my}} {{body"}}
======

One simple solution is to rewrite the tag string into something that is convenient for list manipulation (careful with the quoting in the `[string map]` here!):

(Oops, the syntax highlighting in the wiki renderer was confused by my initial invocation (`string map {=\" " \{" \" \}} $tag`): the one below works better but obfuscates the code somewhat. `\x22` is double quote, `\x7b` is left brace, `\x7d` is right brace. Both invocations work equally well in the Tcl interpreter.)

======
% set taglist [string map [list =\x22 " \x7b" \x22 \x7d] $tag]
# -> body type {text/plain} title {This is my body}
% set result [list]
% foreach {name value} [lrange $taglist 1 end] { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}
======

Another solution `[split]`s the tag string into a list, not by white space but by double quotes (again, `\x22` is just a wiki-friendly way to insert a double quote character: a `\"` or `{"}` will work in the Tcl interpreter):

======
% set taglist2 [split $tag \x22]
# -> {body type=} text/plain { title=} {This is my body} {}
======

Obviously, the result needs a little more processing:

   1. the element name is joined up with the first attribute name,
   2. the equal sign stays attached to the attribute name,
   3. the second (and third, etc) attribute name is preceded by leftover whitespace, and
   4. there is an empty element which resulted from splitting at the last double quote before the end of the string.

The first three problems are easily dealt with (a string consisting of a space and some non-space characters can be split into a list with an empty first item and the non-space substring as the second element):

======
% string trimright [lindex [split {body type=}] 1] =
# -> type
% string trimright [lindex [split { title=}] 1] =
# -> title
======

and the fourth problem can be solved by `[break]`ing out of the loop if any attribute name is the empty string:

======
foreach {name value} $taglist2 {
    if {$name eq {}} { break }
    lappend result [list [string trimright [lindex [split $name] 1] =] $value]
}
======

======none
% set result
# -> {type text/plain} {title {This is my body}}
======

All of the above solutions are minimal in the sense that while they do process the provided string properly, they may not be able to process other strings that must be considered regular attribute lists, e.g.

   1. attribute lists with more (or less) than two attributes
   1. empty attribute values (with `foo=""` notation, HTML syntax allows another way to specify empty attribute values)
   1. attribute expressions with spaces between the equal sign and the attribute name and/or attribute value

The table shows which methods were able to handle which cases:

%|  |`scan`|`regexp`|`string map`|`split` |%
&|1.|      | ✔      | ✔         | ✔      |&
&|2.|      | ✔      | ✔         | ✔      |&
&|3.| ✔    | ✔     |            | ✔      |&

This means that the `regexp` and the `split` method are flexible enough to handle just about any case of "split[[ing]] an argument which contains spaces within quotes into proper name=value pairs". But can they handle full HTML attribute syntax? Not even close. This syntax has different rules for which characters may legally appear inside tag names, attribute names, and attribute values. Empty attributes can be written as just an attribute name, without any equal sign or (empty) value in quotes. Attribute values can be unquoted. Attribute values can be single-quoted.


***So what does it take to scan an HTML start tag?***

It takes something like this. Note: this is a hand-made scanner, not a real industrial-grade scanner, but it does handle ''most of'' HTML's attribute syntax, specifically HTML5 (the part that is missing is the "Shorttags" feature, which is a bit of a mess and not used very often). When putting together this example, I debated with myself whether to support HTML character references / entities (the `&amp;` strings). The scanner does support them, but almost half the state space is used to scan those.

This is a basic table-driven scanner. It works by looking at one character at a time, classifying it into one of the different character classes (they appear as subkeys in the table below, symbolized by a pair of lower-case letters) and then cross-referencing the current state (one of the `S00` -- `S28` major keys in the table) with that character class to get a new state (if the character class is legal at this point in the string) or `ERR` (if it is illegal). The special character class `ei` (for end of input) can lead to the state `ACC`, or accepted string.

If the scanner accepts the string, it returns -1. If an error occurred, it returns the character index where the error happened. The command sets the global variable `tagname` to the tag name, and the global variable `attributes` to a dictionary with the attribute names as keys and the attribute values as values.

You are welcome to try to break it. If you find a start tag string that it should accept but doesn't, or one that it should reject but accepts, please insert it below this paragraph and I'll have a look at it. Note that it assumes that the `< >` brackets around the string are already removed. Also note that it accepts strings that are ''legal'': to be ''valid'' a start tag string has to have a tag name which is an HTML element name, and attribute names and values are also restricted. The string `{foo b&r}` is legal but invalid (while the string `{foo bar="b&z"}` is illegal as well as invalid).

======
set stateMatrix {
    S00 {ei ERR ex S01 af S01 gz S01 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S01 {ei ACC ex S01 af S01 gz S01 nu S01 sp S02 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S02 {ei ACC ex S04 af S04 gz S04 nu S04 sp S02 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am S04 ha S04 sc S04 pr S04}
    S03 {ei ACC ex ERR af ERR gz ERR nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S04 {ei ACC ex S04 af S04 gz S04 nu S04 sp S05 dq ERR sq ERR gt ERR lt ERR sl ERR eq S06 bt ERR am S04 ha S04 sc S04 pr S04}
    S05 {ei ACC ex S04 af S04 gz S04 nu S04 sp S05 dq ERR sq ERR gt ERR lt ERR sl ERR eq S06 bt ERR am S04 ha S04 sc S04 pr S04}
    S06 {ei ERR ex S07 af S07 gz S07 nu S07 sp S06 dq S09 sq S12 gt ERR lt ERR sl ERR eq ERR bt ERR am S07 ha S07 sc S07 pr S07}
    S07 {ei ACC ex S07 af S07 gz S07 nu S07 sp S08 dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am S07 ha S07 sc S07 pr S07}
    S08 {ei ACC ex S04 af S04 gz S04 nu S04 sp S08 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am S04 ha S04 sc S04 pr S04}
    S09 {ei ERR ex S10 af S10 gz S10 nu S10 sp S10 dq S11 sq S10 gt S10 lt S10 sl S10 eq S10 bt S10 am S15 ha S10 sc S10 pr S10}
    S10 {ei ERR ex S10 af S10 gz S10 nu S10 sp S10 dq S11 sq S10 gt S10 lt S10 sl S10 eq S10 bt S10 am S15 ha S10 sc S10 pr S10}
    S11 {ei ACC ex ERR af ERR gz ERR nu ERR sp S08 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S12 {ei ERR ex S13 af S13 gz S13 nu S13 sp S13 dq ERR sq S13 gt S13 lt S13 sl S13 eq S13 bt S13 am S22 ha S13 sc S13 pr S13}
    S13 {ei ERR ex S13 af S13 gz S13 nu S13 sp S13 dq S13 sq S14 gt S13 lt S13 sl S13 eq S13 bt S13 am S22 ha S13 sc S13 pr S13}
    S14 {ei ACC ex ERR af ERR gz ERR nu ERR sp S08 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S15 {ei ERR ex S20 af S20 gz S20 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha S16 sc ERR pr ERR}
    S16 {ei ERR ex S18 af ERR gz ERR nu S17 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S17 {ei ERR ex ERR af ERR gz ERR nu S17 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S21 pr ERR}
    S18 {ei ERR ex ERR af S19 gz ERR nu S19 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S19 {ei ERR ex ERR af S19 gz ERR nu S19 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S21 pr ERR}
    S20 {ei ERR ex S20 af S20 gz S20 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S21 pr ERR}
    S21 {ei ERR ex S10 af S10 gz S10 nu S10 sp S10 dq S11 sq S10 gt S10 lt S10 sl S10 eq S10 bt S10 am S15 ha ERR sc ERR pr S10}
    S22 {ei ERR ex S27 af S27 gz S27 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha S23 sc ERR pr ERR}
    S23 {ei ERR ex S25 af ERR gz ERR nu S24 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S24 {ei ERR ex ERR af ERR gz ERR nu S24 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S28 pr ERR}
    S25 {ei ERR ex ERR af S26 gz ERR nu S26 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S26 {ei ERR ex ERR af S26 gz ERR nu S26 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S28 pr ERR}
    S27 {ei ERR ex S27 af S27 gz S27 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S28 pr ERR}
    S28 {ei ERR ex S13 af S13 gz S13 nu S13 sp S13 dq S13 sq S14 gt S13 lt S13 sl S13 eq S13 bt S13 am S22 ha ERR sc ERR pr S13}
}

proc getCharacterClass char {
    switch -regexp -- $char {
        {[a-fA-F]}    { return af }
        {[xX]}        { return ex }
        {[g-zG-Z]}    { return gz }
        {[0-9]}       { return nu }
        {[ \t\n\f\r]} { return sp }
        \x22          { return dq }
        '             { return sq }
        >             { return gt }
        <             { return lt }
        /             { return sl }
        =             { return eq }
        `             { return bt }
        &             { return am }
        {#}           { return ha }
        {;}           { return sc }
        {[[:print:]]} { return pr }
        default { return -- }
    }
}

proc getNextState {cclass state} {
    if {[dict exists $::stateMatrix $state $cclass]} {
        dict get $::stateMatrix $state $cclass
    } else {
        return ERR
    }
}

proc storetoken varName {
    upvar 1 token token
    if {$token ne {}} {
        if {$varName in {tagname attrname attrval}} {
            upvar 1 $varName var
        } else {
            foreach varName {tagname attrname attrval} {
                upvar 1 $varName var
                if {$var eq {}} break
                if {$varName eq "attrval"} return
            }
        }

        set var $token
        set token {}
    }
}

proc storeattributes {} {
    upvar 1 attrname attrname
    upvar 1 attrval attrval
    if {$attrname ne {}} {
        dict set ::attributes $attrname $attrval
    }
    set attrname {}
    set attrval {}
}

set tagname {}
set attributes {}

proc scanStartTag input {
    set state S00
    set c {}
    set ci 0
    set token {}
    set tagname {}
    set attrname {}
    set attrval {}

    set ::tagname {}
    set ::attributes [dict create]

    while true {
        if {$ci >= [string length $input]} {
            set cclass ei
        } else {
            set c [string index $input $ci]
            set cclass [getCharacterClass $c]
        }
        set state [getNextState $cclass $state]
        switch -- $state {
            ERR {
                storetoken ?
                storeattributes
                set ::tagname $tagname
                return $ci
            }
            ACC {
                storetoken ?
                storeattributes
                set ::tagname $tagname
                return -1
            }
            S00 {
            }
            S01 -
            S07 -
            S10 -
            S13 -
            S15 -
            S16 -
            S17 -
            S18 -
            S19 -
            S20 -
            S21 -
            S22 -
            S23 -
            S24 -
            S25 -
            S26 -
            S27 -
            S28 {
                append token $c
                incr ci
            }
            S02 {
                storetoken tagname
                incr ci
            }
            S03 -
            S09 -
            S12 {
                incr ci
            }
            S04 {
                storeattributes
                append token $c
                incr ci
            }
            S05 -
            S06 {
                storetoken attrname
                incr ci
            }
            S08 -
            S11 -
            S14 {
                storetoken attrval
                storeattributes
                incr ci
            }
            default {
                error "Unknown state $state"
            }
        }
    }
}
======

----
[AM] Also see: [Splitting a string on arbitrary substrings]


<<categories>> Parsing | String Processing | Arts and crafts of Tcl-Tk programming