Mismatch between regexp -indices and switch -regexp -indexvar

PYK 2016-08-15: This page is obsoleted by Tcl version 8.5.10, in which issue 3106532, "switch -regexp -indexvar" gives invalid range" , was fixed.

TJE This little mismatch has bitten me in code, so I thought I'd shed a little more light on the matter for those who may not have picked up on it from the documentation...

The ranges placed in a switch statement's "-indexvar" target are inclusive of the character AFTER the match. This differs from the behavior of regexp's "-indices" option, which is exclusive of the same character.

Here's a simple example:

  % set line {foo bar}
  foo bar
  % regexp -inline -indices {foo} $line
  {0 2}
  % switch -regexp -indexvar index -- $line {foo} {set index}
  {0 3}

As you can see, regexp reports the actual match (character '0' through '2' matches "foo"), whereas switch reports the match PLUS the character after (character '0' through '3' matches "foo ").

Don't get bitten!


Curiously, TIP#75 [1 ] (which seems to be the only one that specifies this feature) states (emphasis added):

the new option -indexvar will also be provided which will name a variable into which a list of match indices (each a two item list of values in the same way that [regexp -indices] computes) will be placed

This rather suggests that the stated mismatch is a bug...

DGP Please have a look at the documentation for regexp and switch. [3 ] [2 ]. Appears to me the switch -indexvar option is operating exactly as it is documented to do. As a meta-comment, I think the Tracker is a much better place to resolve question like this than the wiki.

Is it significant that tcl/tests/switch.test does, in fact, have tests for use of -indexvar (in conjuntion with -matchvar) and the tests appear to be passing? Either the tests aren't testing indexvar the way one would think, the test writer cooked the tests so they would pass even though not producing the real expected results, or the code is doing the right thing, but has, perhaps, the wrong docs. See tcl/generic/tclCmdM.c, function Tcl_SwitchObjCmd for the code which provides the switch functionality.

male - 2008-02-15 - for me it is not really interesting, what the man page is telling, if it references the behaviour of regexp, which is different! And even documented behaviour could be buggy! In my eyes both regexp based features should behave the same, no matter what the man page is telling!

set string "The quick brown fox jumped over the lazy dogs."

set matches {}
set indexes {}

switch -regexp -matchvar matches -indexvar indexes -- $string {
  ^(.*)u([a-z]+)(.*)(o[a-z]+)(.*)\.  {
        puts "Found"
        puts " matches = .${matches}."
        puts " indexes = .${indexes}."
   }
  default {
        puts "string = $string"
  }
}

Found
 matches = .{The quick brown fox jumped over the lazy dogs.} {The quick brown fox j} mped { over the lazy d} ogs {}.
 indexes = .{0 46} {0 21} {22 26} {26 42} {42 45} {45 45}.

DKF: Match ranges are the same. Ending index not (one-off). File a bug.
Correction, not a bug. It's documented to be what it is. (Not saying whether it is "morally" right. Just not a bug per se.)

Lars H: What seems to be the trouble here is that the TIP and its reference implementation contradict each other. As shown above, the text of the TIP states that [switch -indexvar] computes indices in the same way as [regexp -indices]. However, the reference implementation computes them as

                    rangeObjAry[0] = Tcl_NewLongObj(info.matches[j].start);
                    rangeObjAry[1] = Tcl_NewLongObj(info.matches[j].end);

whereas regexp computes them as

                    start = offset + info.matches[i].start;
                    end = offset + info.matches[i].end;

                    /*
                     * Adjust index so it refers to the last character in the
                     * match instead of the first character after the match.
                     */

                    if (end >= offset) {
                        end--;
                    }

There are no test cases or documentation in the reference implementation, only a patch of generic/tclCmdMZ.c.

So what was it that the TCT approved? The specification found in the TIP or the reference implementation provided?

DKF: If I was doing it again, I'd make it match what regexp -indices reports, but formally it is a misfeature and not a bug because it is documented and tested to do exactly what it does. It's just that what it should probably have done is something else. Alas.


TJE Note that I credit the documents (if indirectly) with correctness in this matter. I don't LIKE the behavior, but it is, indeed, documented. My code is fixed with a weird-looking little '-1' appendage. Here $arg is the switched-upon value and $ipair is the extracted index pair I care about:

  set substring [string range $arg {*}$ipair-1]

Whee!