[[ [Scripted Compiler] ]] --> [[ [Parsing C] ]] ---- An important part of any [Scripted Compiler] is the ability actually process the system language underlying the scripting language. In the case of [Tcl] this is the '''[C Language]'''. The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, ''lexing''. The following script lexes a string containing C source into a list of tokens. '''Note:''' The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification. The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples: * Keep the whitespace as tokens. Might be required for a pretty-printer. * Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See [Source Navigator] for a tool in this area. * Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources. The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is [Yeti]. See the [C Language] for grammar references. I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to [Source Navigator]. '''Notes''' The new version 2 does not reinsert extracted tokens into the source string. It also avoids copying the tail of the string down, which can lead to quadratic behaviour. It is still not optimal, but fairly ok in my book so far. Example result: tclIO.c: 242918 characters Lexing in 16498027 microseconds = 16.498027 seconds = 67.916033394 usec/char Not bad for a lexer written in a scripting language [IMHO]. Of course, this is input dependent. tclDecls.h on the other hand is smaller, but takes longer: tclDecls.h: 129917 characters Lexing in 18277958 microseconds = 18.277958 seconds = 140.689501759 usec/char '''TODO''' * The recognition of floats as part of the vari-sized stuff slows things down considerably. Removing these definitions we get down to 56.3452070246 usec/char = 13.687265 seconds for the whole ''tclIO.c''. Another regex stage before the punctuation processing might be better. * Invert the lexer, process string fragments immediately. This gets rid of the need for a marker character (\001 here) and all that entails. * Read up on C syntax. I believe that I currently do not recognize all possible types of numbers. ---- '''Here be dragons''' '''1''' Tcl 8.4 alpha 3 has a bug in the encoding/utf/string handling subsystem which causes the code here to loose characters. Do not use this alpha version. Upgrade to 8.4. Example of the problem, provided by [PT]: % clex::lex {int main (int argc, char** argv) { return 0; }} int in {} t gc {} ar {} {} gv {} {} return {} {} {} {} ---- '''clex.tcl''' (The code, finally :) #!/bin/sh # -*- tcl -*- \ exec tclsh "$0" ${1+"$@"} source clex.tcl # Read file, lex it, time the execution to measure performance set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh] set len [string length $data] set usec [lindex [time {set data [clex::lex $data]}] 0] # Write performance statistics. puts "$fname:" puts "\t$len characters" puts "\tLexing in $usec microseconds" puts "\t = [expr {double($usec)/1000000}] seconds" puts "\t = [expr {double ($usec) / double ($len)}] usec/char" exit # Generate tokenized listing of the input, using the lexing results as input. puts __________________________________________________ foreach {sym str attr} $data break foreach {aidx aval} $attr break foreach {sidx sval} $str break set sv 0 set av 0 foreach s $sym { switch -glob -- $s { *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av} *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv} * {puts "$s"} } } puts __________________________________________________ exit ---- '''driver''' #!/bin/sh # -*- tcl -*- \ exec tclsh "$0" ${1+"$@"} source clex.tcl # Read file, lex it, time the execution to measure performance set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh] set len [string length $data] set usec [lindex [time {set data [clex::lex $data]}] 0] # Write performance statistics. puts "$fname:" puts "\t$len characters" puts "\tLexing in $usec microseconds" puts "\t = [expr {double($usec)/1000000}] seconds" puts "\t = [expr {double ($usec) / double ($len)}] usec/char" exit # Generate tokenized listing of the input, using the lexing results as input. puts __________________________________________________ foreach {sym str attr} $data break foreach {aidx aval} $attr break foreach {sidx sval} $str break set sv 0 set av 0 foreach s $sym { switch -glob -- $s { *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av} *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv} * {puts "$s"} } } puts __________________________________________________ exit ---- [[ [Scripted Compiler] ]] --> [[ [Parsing C] ]]