Version 10 of GSoC Idea: Parser For expr Command

Parser For expr Command


Areas	Language parsing, computer arithmetic
Good if student knows	variable, based on interests
Priority	Low
Difficulty	low to medium
Benefits to the student	introduction to grammar and parsing concepts
Benefits to Tcl	Leverage ease of use of existing tools
Mentor	Steve Huntley

Project Description

Perl enables arbitrary precision math by means of a pragma, which overloads the standard math operators and thus transparently accepts large integers as arguments for mathematical expressions. In this way Perl is able to do not only large integer math transparently as Tcl does, but also floating-point, rational number, vector, etc. calculations [L1 ] [L2 ].

This approach could not work with Tcl, since it does not have math operators per se, but instead has the command expr with its own syntax for math expressions. However, it should be straightforward to overload the expr command itself; with a command that is able to parse existing valid math expressions, and can be expanded to accept a wider range of operations.

The goal of this project would be to write a selection of parsers/lexers for mathematical expressions, and incorporate them into a replacement for the expr command, thus allowing for transparent use of a wider range of numerical types (similar to Perl's usage), and experiments on new ways to execute expressions that are currently valid, such as:

compiling to C statements and executing via critcl
compiling to bytecode using methods outlined by KBK [L3 ]
automatic interfacing to graphing functions

The exact mix of features would be dependent on the skills and interests of the student.

Existing pure-Tcl parsing tools such as Yeti or taccle should be suitable for the task. Evaluation of parser features and selection of tools would be part of the project.

JBR - I've had very good experiences with grammar_peg which is included in tcllib. Look for my examples posted on the wiki.

Requirements:

flexible

Discussion

Larry Smith: incorporate the Tcl expr patch allowing us to eliminate the dereferencing operator and outer braces? Please? Finally? allowing [ expr x*x ] instead of [ expr {$x*$x} ]

AMG: There are a lot of syntax conflicts which require variable names to be preceded by $. As for removing the outer braces, you can do so currently, though it is unsafe and slow.

However, these two changes at the same time would avoid part of the safety problem, since [expr] (not the Tcl parser) would be performing the variable substitution. But the speed problem would remain. So long as there are any spaces in the expression, there's more than one argument to [expr], and it has to [concat] them, which makes it impossible to bytecode the expression.

Larry Smith Why should this be the case? Granted the current implementation does have this limitation, but it seems to me a SMOP.

The remaining safety issue is due to [square bracket script substitution], which would still be performed by the Tcl parser instead of internal to [expr].

Some syntax conflicts, ambiguities, and difficulties:

Arrays and functions. Is cos(-1) calling a function "cos" with argument "-1", or is it looking for an element "-1" in an array called "cos"?

Larry Smith In the context of expr, it is calling the function "cos" with argument -1. This is an artifact of the way Tcl follows C usage even though its own syntax is very different. In order to support arrays, some new notation would be needed. Following C, this would be cos-1. That just exacerbates the problem.

-OR-

One merely notes that "cos" is the custodian of some data, the specific instance needed being accessed by the specifier "-1". In other words, the distinction between arrays and functions is an artificial one, and not essential to the language or to our coding patterns. If "cos" is a function, it's a function call, if it isn't and "cos" is an array, it's an array access. If "cos" is neither, it is an error (or some third case).

Array elements. foo(bar): The array is named "foo", that's no problem. But is the element literally "bar", or is it the string stored in the variable "bar"?

Larry Smith This is also an artifact, this time of nature of the specifier used to access the data and the fact that Tcl puts no real limits on it. I would suggest it would mean the string stored in "bar". If you want "bar" literally, you would need to quote it, foo("bar").

Functions with an extra space. It's legal to put space between the function name and the argument list, e.g. cos (-1). Is this calling "cos" or looking in the variable "cos"?

Larry Smith We often run into various pathological cases. The variable {} is one, this is another. While they are permitted under current syntax they are not good programming practices (much too "tricky", and usually easily replaced with less opaque code). As such I feel no real problem exists in removing the feature. "cos (-1)" would be the value of the variable "cos" followed by a space and the result of applying the -1 specification to the object represented by the null string - the function "", or the array {}(). Yes it may break some code, but I don't think that's a bad thing, it points up a wonky piece of code that needs attention anyway.

Variable names with special characters. "a b" (included space) is a valid variable name, but a simple parser might see it as two separate variables separated by a space. That's why there's "${a b}", with braces.

Larry Smith Again, the pathological case. "a b" would refer to the values of a and b concatenated with a space. If you want an embedded space, you would need to specify "a\ b".

Variable names that look like numbers. "1" is a valid variable name, so are "0xa" and "5.2e-7". They're also valid numbers. How can the parser tell the difference? If it assumes the leading digit makes it a number, then there needs to be a way to force the variable interpretation, such as "${5.2e-7)". But that brings back dollar sign notation, which defeats the purpose of your proposal and raises the safety issue again.

Larry Smith Yet another pathological case, so much so I see no real problem with requiring the $ prefix to force interpretation as a variable.

Variable names that look like operators. "+" is also a valid variable name!

Larry Smith and you would refer to it as "$+".

Empty string variable. Believe it or not, but variables can be named empty string. How would you take the value, other than "${}"? Well, there's "$::", or "::" as you would have it.

Larry Smith Or just $"". As I see it, this proposal is an opportunity to make expr more flexible and concise by making the corner cases the syntactical exception that they really are, rather than allowing them to dictate the (verbose and painful) notation needed for the normal cases.

Nevertheless, here's a simple example demonstrating an overloaded [expr] that behaves as you ask. Note that it still has the safety problem, since variable substitution is performed by Tcl before calling [expr].

rename expr _expr; proc expr {args} {
    uplevel 1 _expr [regsub -all -nocase {[a-z:][a-z0-9_:]*\M(?!\()} [concat $args] {$&}]
}

This code doesn't get all cases, it doesn't support arrays, and it screws up the ?: ternary operator. Since it doesn't support arrays, which are used by the [history] mechanism, it won't work with an interactive Tcl session. For interactive use, try this:

proc expr2 {args} {
    uplevel 1 expr [regsub -all -nocase {[a-z:][a-z0-9_:]*\M(?!\()} [concat $args] {$&}]
}

SEH Since we're considering parsing and grammars here, the grammar devised could be more C-like and include keywords and everything. Thus trying to use a variable named e.g. cos could throw an error. But I was more interested in adding functionality than providing syntactic sugar. Another possibility: a parser specialized for currency calculations, thus solving a real problem.

Larry Smith I would argue that this change does add functionality, since I find a more readable and typeable syntax a win-win situation. However, if you look into the possibility of stacking various expr parsers these changes add a lot of new functionality. One preprocessor might translate "$1.34" as "dollars 1 34", for example, before handing it off the more general case (likewise £¥₠ to pound sterling, yen, and Euro). Another might be able to recognize vectors or numbers in some format - something like 2 3 ρ ι6 from APL say - permitting array operations in the same concise manner. Each layer would further specialize the underlying mathematical engine of expr to handle new problem domains in a concise manner.

GSOC