UPL: The Language Parser(s)/Interpreter(s)

Peter Newman 9 January 2005 ---------------------- Unified Programming Language

The language parsers are what parse and run the program scripts.

Obviously, each language (Perl, Tcl, C, etc) or variant thereof has its own parser/interpreter.

One defect with current versions of Tcl and Perl, is that the parser/interpreter parses and then immediately executes the source code. I know that's not totally true; there's an intermediate byte-code step in between. But in effect it's what happens.

The problem with this is that it makes writing compilers, syntax checkers and language convertors etc. very difficult. Because every such tool not only has to compile and syntax check (or whatever it does), it first has to parse the source code - interpreting it exactly as the real parser does.

This is by no means an easy thing to do.

So UPL divides the whole parsing and execution thing up. There is:-

  • The Parser - which parses the source code - and converts into a stream of tokens - probably in tree format - that describes the commands/objects and their arguments found. That tokenised form of the program can then be passed directly to one of the tools below - or saved to disk for processing/analysis later.
  • The Byte Code Compiler - which converts the tokenised source code into byte-code form - which can be either saved to disk for later execution - or executed directly.
  • The Executioner - which executes the byte code - received either on-the-fly from the byte-code compiler - or read in from a file on disk.
  • The C Dumper - which converts the tokenised form of the program into C code - which can then be compiled.

And then there are some optional things:-

  • The Tree Executioner - which executes the tokenised form of the program.
  • The Direct Executioner - which is a parser that executes the source code immediately.
  • The Reverse Parser - which converts the tokenised form of the program back into source code form.
  • The Reverse Byte Code Compiler - which converts byte code back into tokenised form.
  • The Obfuscator - which converts the variable and function names in the tokenised form of the program into meaningless trash - and then calls the Reverse Parser to convert this (hopefully now incomprehensible mess) back into source code form.
  • The Language Translators - If different languages were able to share the same tokenised and/or byte code forms - then automatically translating between them becomes possible.
  • The Syntax Checker - which analyses the tokenised form of the program.

PWQ 10 Jan 05 I would suggest a Tree Executioner to eliminate the need to have a bytecode compiler.

Peter Newman Done! A Direct Executioner too.


DKF: What bytecodes do you define? Should there be a mechanism for modules to define new bytecodes? If that's the case, how do you stop collisions between bytecodes defined by different modules when you transport a saved bytecode sequence from one system to another?

PWQ: Dkf, clearly the issue of byte code clashes can be avoided by any number of means. Rather than say how, why don't we just say why not?. It's really a non issue as far as this discussion is concerned.

LV And how do you construct application safety, so that virii don't generate dangerous bytecodes which trash a system?

PWQ LV, Any time you load a binary extension into TCL you are at the mercy of the author. If they have put malicious code in there then it can be executed. The fact that byte codes can do the same thing hardly raises any special issue or concern.

Peter Newman 11 January 2005: I don't know. Haven't thought about those issues yet. Obviously it can be done.

But with this spec. the idea was to start at the top with a very high-level overview of the main features and goals of the language. And then gradually work down to spec out and then code up the details (coding say in Tcl first, and then C, once we're satisfied with the results).

Also, all those components suggested above are the sorts of things that are found in current implementations of scripting languages like Tcl and Perl. But UPL is modular - with every component an optional part. So if somone wanted the parser to directly execute it's own output, they're free to write such a beast.

Similarly, the Tree Executioner. There's nothing wrong executing the tokenised form of the program either.

And if multiple components are used, there's nothing wrong with different sets of components using different APIs. In other words, there can be as many different structures/definitions of the tokenised and byte-coded forms of the program, as people care to implement.

Our language should be a dynamic, living beast; capable of evolving and improving all the time.

So if you want to get spec'ing away - on parsers, and byte code compilers, etc - then just go for it. IMHO it would be a good idea to start with a high-level description of the current Tcl implementations of these - noting perhaps their objectives and pros and cons. And perhaps the alternatives that might be worthwhile. I guess DKF, you know as much about this as anyone.

A high-level Tcl implementation of the current Tcl parser would be a useful starting point. AFAIK it's basically OK. But there are one or two areas that I think could be looked at:-

  1. Can't we do something about that silly mis-matched braces in comment thing (surely the parser can be made to figure this out for itself).
  2. Backslash line continuations also annoy me. Can't the parser be made smart enough to do without these? The Perl solution is (IMHO) much better. The line is ended by semi-colon - and may be broken over as many preceding lines as you want (without requiring any line continuators). This at least should be an option for those of us that prefer it.

There must be many languishing TIP's on these and similar issues.

DKF: Funny, there's not a single one (languishing or otherwise). :^) The braces-in-comment issue is a deep syntactic issue that stems from the fact that Tcl isn't parsed in the same way that languages based on BNF-based grammars are parsed (which has many up-sides as well; it's much easier to do a little-language in Tcl than nearly any other language). The backslash-line-continuation is something I like, but this is one of these religious issues where YMMV compared to mine. But don't be pessimistic just because I am, write some code. Prove your ideas can work by demonstration. Write a paper about it (but only after coding up a basic example, please.)

Peter Newman Yeah that's what I'm doing. Except that I'm starting at the top and working down. For me, the coding comes last - once we've figured out what we want to code.

DKF: In the free-and-open-source-software world, producing an early interesting prototype is advisable. Even if it can't do everything you plan, getting it so it can be at least somewhat useful helps other people to join in.

Peter Newman OK, I appreciate that. But there are still many issues where I want to examine the options and alternatives - and sketch something out on paper first. For example:-

  • The command line parameters to the UPL: The Bootstrap Interpreter (I suppose that's pretty easy; it's just the name of the UPL: The Bootstrap File
  • I assume that the UPL: The Bootstrap Interpreter then loads any modules specified in UPL: The Bootstrap File using LoadLibrary on Windows and (I forget the name of the corresponding library on Linux).
  • How the UPL: The Bootstrap Interpreter, once it's loaded any modules required, transfers control to the first language interpreter (we need to define the API/calling conventions for this).
  • I think we can ignore how language interpreter X nests down to language interpreter Y at this stage.
  • Then there's the language parser to design - which IMHO should start of as a direct clone of the existing Tcl parser.
  • The syntax of the tokenised form of the program (produced by the parser) has to be specified.
  • We can ignore the byte-coding thing at this stage - and pilot things out with a Tree Executioner.
  • Then there's the issue of the C code in the DLL/so's - and how the (Tree) Executioner calls it.
  • We can probably implement some of the basic Tcl commands in C quite easily. The file commands and puts etc, are really just wrappers around standard C functions.

Once those issues have been resolved (which will presumably take a few weeks at least), then we can code up these basics in high-level Tcl (though maybe some C will be required). If I have to do it all myself, that prototype's a few months away, at least.


Category Discussion