[[ [Scripted Compiler] ]] --> [[ [Parsing C] ]]
----

An important part of any [Scripted Compiler] is the ability actually
process the system language underlying the scripting language. In the
case of [Tcl] this is the '''[C Language]'''.

The first step is always to separate the input stream into tokens,
each representing one semantic atom. In compiler speak, ''lexing''.

The following script lexes a string containing C source into a list of
tokens. '''Note:''' The script assumes that the sources are free of
preprocessor statements like "#include", "#define", etc. I am also not
sure if the list of 'Def'd tokens is actually complete. This requires
checking against the C specification.

The script below is only one of many variations, i.e. it can be
twiddled in many ways. Examples:

   * Keep the whitespace as tokens. Might be required for a pretty-printer.

   * Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See [Source Navigator] for a tool in this area.

   * Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources.

The next step will be parsing, i.e. adding structure to the token
stream under control of a grammar. An existing tool for that is [Yeti].
See the [C Language] for grammar references.

I believe that the method I have used below can be used to lex any
system language currently in use today, Pascal, Modula, FORTRAN,
C++, ... Again this is something of interest to [Source Navigator].


'''Notes'''

The new version 2 does not reinsert extracted tokens into the source string.
It also avoids copying the tail of the string down, which can lead to quadratic
behaviour. It is still not optimal, but fairly ok in my book so far. Example
result:

 tclIO.c:
        242918 characters
        Lexing in 16498027 microseconds
               =  16.498027 seconds
               =  67.916033394 usec/char

Not bad for a lexer written in a scripting language [IMHO].
Of course, this is input dependent. tclDecls.h on the other
hand is smaller, but takes longer:

 tclDecls.h:
        129917 characters
        Lexing in 18277958 microseconds
               =  18.277958 seconds
               =  140.689501759 usec/char

'''TODO'''

   * The recognition of floats as part of the vari-sized stuff slows things down considerably. Removing these definitions we get down to 56.3452070246 usec/char = 13.687265 seconds for the whole ''tclIO.c''. Another regex stage before the punctuation processing might be better.

   * Invert the lexer, process string fragments immediately. This gets rid of the need for a marker character (\001 here) and all that entails.

   * Read up on C syntax. I believe that I currently do not recognize all possible types of numbers.

----

'''Here be dragons'''

'''1''' Tcl 8.4 alpha 3 has a bug in the encoding/utf/string handling
subsystem which causes the code here to loose characters. Do not use
this alpha version. Upgrade to 8.4.

Example of the problem, provided by [PT]:

 % clex::lex {int main (int argc, char** argv) { return 0; }}
 int in {} t gc {} ar {} {} gv {} {} return {} {} {} {}

----
'''clex.tcl''' (The code, finally :)

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh "$0" ${1+"$@"}
 
 source clex.tcl
 
 # Read file, lex it, time the execution to measure performance
 
 set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh]
 set len  [string length $data]
 set usec [lindex [time {set data [clex::lex $data]}] 0]
 
 # Write performance statistics.
 
 puts "$fname:"
 puts "\t$len characters"
 puts "\tLexing in $usec microseconds"
 puts "\t       =  [expr {double($usec)/1000000}] seconds"
 puts "\t       =  [expr {double ($usec) / double ($len)}] usec/char"
 
 exit
 
 # Generate tokenized listing of the input, using the lexing results as input.
 
 puts __________________________________________________
 foreach {sym str attr} $data break
 foreach {aidx aval} $attr break
 foreach {sidx sval} $str break
 
 set sv 0
 set av 0
 foreach s $sym {
     switch -glob -- $s {
 	*+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av}
 	*- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv}
 	*  {puts "$s"}
     }
 }
 
 puts __________________________________________________
 exit

----
'''driver'''

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh "$0" ${1+"$@"}
 
 source clex.tcl
 
 # Read file, lex it, time the execution to measure performance
 
 set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh]
 set len  [string length $data]
 set usec [lindex [time {set data [clex::lex $data]}] 0]
 
 # Write performance statistics.
 
 puts "$fname:"
 puts "\t$len characters"
 puts "\tLexing in $usec microseconds"
 puts "\t       =  [expr {double($usec)/1000000}] seconds"
 puts "\t       =  [expr {double ($usec) / double ($len)}] usec/char"
 
 exit
 
 # Generate tokenized listing of the input, using the lexing results as input.
 
 puts __________________________________________________
 foreach {sym str attr} $data break
 foreach {aidx aval} $attr break
 foreach {sidx sval} $str break
 
 set sv 0
 set av 0
 foreach s $sym {
     switch -glob -- $s {
 	*+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av}
 	*- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv}
 	*  {puts "$s"}
     }
 }
 
 puts __________________________________________________
 exit

----
[[ [Scripted Compiler] ]] --> [[ [Parsing C] ]]