Version 22 of Lexing C

[ Scripted Compiler ] --> [ Parsing C ]

An important part of any Scripted Compiler is the ability actually process the system language underlying the scripting language. In the case of Tcl this is the C Language.

The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, lexing.

The following script lexes a string containing C source into a list of tokens. Note: The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification.

The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples:

Keep the whitespace as tokens. Might be required for a pretty-printer.
Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See Source Navigator for a tool in this area.
Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources.

The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is Yeti. See the C Language for grammar references.

I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to Source Navigator.

Notes

The new version 2 does not reinsert extracted tokens into the source string. It also avoids copying the tail of the string down, which can lead to quadratic behaviour. It is still not optimal, but fairly ok in my book so far. Example result:

 tclIO.c:
        242918 characters
        Lexing in 16498027 microseconds
               =  16.498027 seconds
               =  67.916033394 usec/char

Not bad for a lexer written in a scripting language IMHO. Of course, this is input dependent. tclDecls.h on the other hand is smaller, but takes longer:

 tclDecls.h:
        129917 characters
        Lexing in 18277958 microseconds
               =  18.277958 seconds
               =  140.689501759 usec/char

TODO

The recognition of floats as part of the vari-sized stuff slows things down considerably. Removing these definitions we get down to 56.3452070246 usec/char = 13.687265 seconds for the whole tclIO.c. Another regex stage before the punctuation processing might be better.
Invert the lexer, process string fragments immediately. This gets rid of the need for a marker character (\001 here) and all that entails.
Read up on C syntax. I believe that I currently do not recognize all possible types of numbers.

Here be dragons

1 Tcl 8.4 alpha 3 has a bug in the encoding/utf/string handling subsystem which causes the code here to loose characters. Do not use this alpha version. Upgrade to 8.4.

Example of the problem, provided by PT:

 % clex::lex {int main (int argc, char** argv) { return 0; }}
 int in {} t gc {} ar {} {} gv {} {} return {} {} {} {}

clex.tcl (The code, finally :)

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh "$0" ${1+"$@"}

 source clex.tcl

 # Read file, lex it, time the execution to measure performance

 set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh]
 set len  [string length $data]
 set usec [lindex [time {set data [clex::lex $data]}] 0]

 # Write performance statistics.

 puts "$fname:"
 puts "\t$len characters"
 puts "\tLexing in $usec microseconds"
 puts "\t       =  [expr {double($usec)/1000000}] seconds"
 puts "\t       =  [expr {double ($usec) / double ($len)}] usec/char"

 exit

 # Generate tokenized listing of the input, using the lexing results as input.

 puts __________________________________________________
 foreach {sym str attr} $data break
 foreach {aidx aval} $attr break
 foreach {sidx sval} $str break

 set sv 0
 set av 0
 foreach s $sym {
     switch -glob -- $s {
         *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av}
         *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv}
         *  {puts "$s"}
     }
 }

 puts __________________________________________________
 exit

driver

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh "$0" ${1+"$@"}

 source clex.tcl

 # Read file, lex it, time the execution to measure performance

 set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh]
 set len  [string length $data]
 set usec [lindex [time {set data [clex::lex $data]}] 0]

 # Write performance statistics.

 puts "$fname:"
 puts "\t$len characters"
 puts "\tLexing in $usec microseconds"
 puts "\t       =  [expr {double($usec)/1000000}] seconds"
 puts "\t       =  [expr {double ($usec) / double ($len)}] usec/char"

 exit

 # Generate tokenized listing of the input, using the lexing results as input.

 puts __________________________________________________
 foreach {sym str attr} $data break
 foreach {aidx aval} $attr break
 foreach {sidx sval} $str break

 set sv 0
 set av 0
 foreach s $sym {
     switch -glob -- $s {
         *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av}
         *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv}
         *  {puts "$s"}
     }
 }

 puts __________________________________________________
 exit

[ Scripted Compiler ] --> [ Parsing C ]