[ Scripted Compiler ] --> [ Parsing C ]
An important part of any Scripted Compiler is the ability actually process the system language underlying the scripting language. In the case of Tcl this is the C Language.
The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, lexing.
The following script lexes a string containing C source into a list of tokens. Note: The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification.
The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples:
The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is Yeti. See the C Language for grammar references.
I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to Source Navigator.
Notes
The new version 2 does not reinsert extracted tokens into the source string. It also avoids copying the tail of the string down, which can lead to quadratic behaviour. It is still not optimal, but fairly ok in my book so far. Example result:
tclIO.c: 242918 characters Lexing in 16498027 microseconds = 16.498027 seconds = 67.916033394 usec/char
Not bad for a lexer written in a scripting language IMHO. Of course, this is input dependent. tclDecls.h on the other hand is smaller, but takes longer:
tclDecls.h: 129917 characters Lexing in 18277958 microseconds = 18.277958 seconds = 140.689501759 usec/char
TODO
Here be dragons
1 Tcl 8.4 alpha 3 has a bug in the encoding/utf/string handling subsystem which causes the code here to loose characters. Do not use this alpha version. Upgrade to 8.4.
Example of the problem, provided by PT:
% clex::lex {int main (int argc, char** argv) { return 0; }} int in {} t gc {} ar {} {} gv {} {} return {} {} {} {}
clex.tcl (The code, finally :)
#!/bin/sh # -*- tcl -*- \ exec tclsh "$0" ${1+"$@"} source clex.tcl # Read file, lex it, time the execution to measure performance set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh] set len [string length $data] set usec [lindex [time {set data [clex::lex $data]}] 0] # Write performance statistics. puts "$fname:" puts "\t$len characters" puts "\tLexing in $usec microseconds" puts "\t = [expr {double($usec)/1000000}] seconds" puts "\t = [expr {double ($usec) / double ($len)}] usec/char" exit # Generate tokenized listing of the input, using the lexing results as input. puts __________________________________________________ foreach {sym str attr} $data break foreach {aidx aval} $attr break foreach {sidx sval} $str break set sv 0 set av 0 foreach s $sym { switch -glob -- $s { *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av} *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv} * {puts "$s"} } } puts __________________________________________________ exit
driver
#!/bin/sh # -*- tcl -*- \ exec tclsh "$0" ${1+"$@"} source clex.tcl # Read file, lex it, time the execution to measure performance set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh] set len [string length $data] set usec [lindex [time {set data [clex::lex $data]}] 0] # Write performance statistics. puts "$fname:" puts "\t$len characters" puts "\tLexing in $usec microseconds" puts "\t = [expr {double($usec)/1000000}] seconds" puts "\t = [expr {double ($usec) / double ($len)}] usec/char" exit # Generate tokenized listing of the input, using the lexing results as input. puts __________________________________________________ foreach {sym str attr} $data break foreach {aidx aval} $attr break foreach {sidx sval} $str break set sv 0 set av 0 foreach s $sym { switch -glob -- $s { *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av} *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv} * {puts "$s"} } } puts __________________________________________________ exit
[ Scripted Compiler ] --> [ Parsing C ]