[[ [Scripted Compiler] ]] --> [[ [Parsing C] ]] ---- An important part of any [Scripted Compiler] is the ability actually process the system language underlying the scripting language. In the case of [Tcl] this is the '''[C Language]'''. The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, ''lexing''. The following script lexes a string containing C source into a list of tokens. It assumes that the sources are free of preprocessor statements like "#include", "#define", etc. Also note that the script is built upon the base package provided in [Scripted Lexing]. The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples: * Keep the whitespace as tokens. Might be required for a pretty-printer. * Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See [Source Navigator] for a tool in this area. * Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources. The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is [Yeti]. See the [C Language] for grammar references. I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to [Source Navigator]. '''Notes''' The new version 2 does not reinsert extracted tokens into the source string. It also avoids copying the tail of the string down, which can lead to quadratic behaviour. It is still not optimal, but fairly ok in my book so far. Example result: tclIO.c: 242918 characters Lexing in 16498027 microseconds = 16.498027 seconds = 67.916033394 usec/char Not bad for a lexer written in a scripting language [IMHO]. '''TODO''' * The recognition of floats as part of the vari-sized stuff slows things down considerably. Removing these definitions we get down to 56.3452070246 usec/char = 13.687265 seconds for the whole ''tclIO.c''. Another regex stage before the punctuation processing might be better. * Invert the lexer, process string fragments immediately. This gets rid of the need for a marker character (\001 here) and all that entails. * Read up on C syntax. I believe that I currently do not recognize all possible types of numbers. ---- '''Here be dragons''' '''1''' Tcl 8.4 alpha 3 has a bug in the encoding/utf/string handling subsystem which causes the code here to loose characters. Do not use this alpha version. Upgrade to 8.4. Example of the problem, provided by [PT]: % clex::lex {int main (int argc, char** argv) { return 0; }} int in {} t gc {} ar {} {} gv {} {} return {} {} {} {} ---- '''clex.tcl''' (The code, finally :) # -*- tcl -*- # Lexing C package provide clex 2.0 namespace eval clex {} # Three stage lexer for C sources. # 1. Recognize complex structures, i.e. "strings", 'c'haracters and # /* comments */ # 2. Recognize punctuation. # 3. Recognize keywords. # 4. Recognize other constructs via regexes # 5. The remainder is seen as identifiers. # # Punctuation is defined in a general data structure mapping them to # symbols. Ditto for keywords. The complex structures are defined via # match commands. proc clex::DefStart {} { variable punctuation [list] variable keywords [list] variable matchers [list] variable idsymbol "" variable ws "" variable rxmatchers [list] return } proc clex::DefWS {wspattern} { variable ws $wspattern return } proc clex::DefI {replacement} { # I = Identifier. # Define the symbol to use for non-keyword identifiers. variable idsymbol $replacement return } proc clex::DefP {string {replacement {}}} { # P = Punctuation if {$replacement == {}} { set replacement $string } variable punctuation lappend punctuation $string \000\001$replacement\000 return } proc clex::DefK {string {replacement {}}} { # K = Keyword if {$replacement == {}} { set replacement [string toupper $string] } variable keywords lappend keywords $string $replacement return } proc clex::DefM {cmd {symbol {}}} { # M = Matcher if {$symbol == {}} { set symbol $cmd } variable matchers lappend matchers $cmd $symbol return } proc clex::DefRxM {pattern {symbol {}}} { # RxM = Regex Matcher if {$symbol == {}} { set symbol $pattern } variable rxmatchers lappend rxmatchers $pattern $symbol return } proc clex::DefEnd {} { variable punctuation variable keywords variable matchsym variable matchers # Sort punctuation alphabetically, incidentially by length, place # the longest first. That way the lexer will correctly distinguish # between sequences like '>' and '>='. array set tmp $punctuation set res [list] foreach key [lsort -decreasing [array names tmp]] { lappend res $key $tmp($key) } set punctuation $res # The keywords we place into an array to allow an easy checking of # identifiers for keywordiness. unset tmp ; set tmp $keywords unset keywords array set keywords $tmp # Matchers are executed in the same order # they were defined in. We do hash to their # symbols. set res [list] foreach {cmd sym} $matchers { lappend res $cmd set matchsym($cmd) $sym } set matchers $res return } proc ::clex::Match {string} { # Executes the matchers repeatedly until no match is found anymore. # # Returns a 3-element list. First element is a list of string # fragments and symbols. Symbols have \001 as leading characters # to distinguish them from the unprocessed string fragments. They # also have a '-' character as suffix. See the documentation of # the main lexer command for an explanation. The second element is # a list of 2 elements, again. The second element is a list of all # unique matching entities. The first element is a list of indices # into the the second element. In this way multiple occurences of # the same entities are collapsed. variable matchers variable matchsym set res [list] set values [list] set validx [list] array set ent {} set start 0 set end [string length $string] # The call to matchers are optimized. Initially each matcher gives # a guess where the next occurence of a pattern owned by him # is. In each iteration the nearest pattern is selected, completed # and processed. Afterward each matchers whose guess was before # the end of the current pattern has to reguess based upon the new # starting location. Why so "complicated" ? Performance. Example: # Let us assume that we have 10 "strings" at the beginning, some # substantial uncommented code, followed by a /* comment */ at the # end. # A simple implementation querying each matcher in each iteration # will not only find the string at the very beginning but will # also every time search through the whole string to find the # comment at the end. This amounts to overall quadratic behaviour. # The chosen implementation on the other hand remembers the old # guesses and invalidates them only if the last best guess went # beyond that, indicating an overlap. In the example above this # means that the string guess is invalidated and renewed # everytime, but the comment guess is done once and nver touched # until the strings are exhausted. # Processing tcl/generic/tclIO.c (263252 characters) takes only # 10 seconds, 7 less than the 17 seconds the simple algorithm # used to do its work. array set loc {} foreach cmd $matchers { set loc($cmd) [$cmd find $string $start] #puts "$cmd = $loc($cmd)" } while {1} { # Run through each matcher and ask them for the nearest # occurence of a string handled by them. We abort the # search early if we find a match directly at the beginning # of the string. We stop the whole processing if no match # found. #puts "$start | $end" set nearest $end set mcmd {} foreach cmd $matchers { set begin $loc($cmd) #puts "\t$cmd = $begin" if {$begin >= 0 && $begin < $nearest} { set nearest $begin set mcmd $cmd } if {$nearest == $start} {break} } #puts "... $nearest $mcmd" if {$nearest == $end} { #puts "___ done" incr nearest -1 if {$start < $nearest} { lappend res [string range $string $start end] } break } # A match was found. We now ask the matcher to deliver the end # of the match. This is then used to slice the input into the # relevant pieces. set stop [$mcmd complete $string $nearest] # Note that the code below causes conversion to UTF16 for easy # indexing into the string. It also copies only the relevant # portions of the string, and avoids to copy down the # remainder of the string, something which would incur a heavy # performance penalty (quadratic behaviour). if {$nearest > 0} { incr nearest -1 lappend res [string range $string $start $nearest] incr nearest } lappend res \001$matchsym($mcmd)- set value [string range $string $nearest $stop] if {[info exists ent($value)]} { set idx $ent($value) } else { set idx [llength $values] lappend values $value set ent($value) $idx } lappend validx $idx set start $stop incr start #puts "... $nearest ... $stop ($value) $mcmd $matchsym($mcmd)" # And go looking for more matches. # Update invalidated guesses. # Performance fix: We do not have to renew matchers which already # told us that the remainder of the input does not contain anything # they can match. foreach cmd $matchers { set begin $loc($cmd) if {($begin >= 0) && ($begin < $start)} { set loc($cmd) [$cmd find $string $start] #puts "$cmd = $loc($cmd) renew" } } } return [list $res [list $validx $values]] } # The command will return a list containing 3 elements, each of them # lists in their own right. # # 1. Symbols. A suffix of '+' means that the symbol has an attribute # value. The suffix '-' has the same meaning, except that the value # has to be taken from the list of strings. # 2. A list of strings and other data, as recognized in phase one. # Symbols with a string value are suffixed by '-', s.a. # 3. A list of symbol attribute values. Symbols with such a value # are suffixed by '+', s.a. # proc clex::lex {code} { variable punctuation variable keywords variable idsymbol variable ws variable rxmatchers #puts stderr "----$ws" # Extract complex vari-sized entities, like comments, strings, and # character data. set res [list] set attr [list] set attridx [list] array set ident {} foreach {fragments strings} [Match $code] break foreach string $fragments { if {[string equal \001 [string index $string 0]]} { # Symbols are passing through, just strip off their marker # character lappend res [string range $string 1 end] continue } # String fragment, have to find punctuation, keywords, and # identifiers. Punctuation is [string map]-separated from the # rest. Then whitespace is normalized. At last we split and # iterate over the result, detecting keywords as we go. foreach item [split [string map $punctuation $string] \000] { # Ignore empty stuff, artefact of the punct. and whitespace # handling if {[string length $item] == 0} {continue} #puts stderr "__ $item __" # Pass Punctuation symbols if {[string equal \001 [string index $item 0]]} { lappend res [string range $item 1 end] continue } regsub -all -- $ws $item \000 tmp #puts stderr "________ $tmp __" foreach phrase [split $tmp \000] { if {[string length $phrase] == 0} {continue} # Recognize keywords if {[info exists keywords($phrase)]} { lappend res $keywords($phrase) continue } # Go through additional regexes to see if there are # special symbols which are neither keywords, nor # identifiers. Like numeric constants. Whatever is # matched, the phrase is added to the phrase pool. set found 0 foreach {p sym} $rxmatchers { if {[regexp -- $p $phrase]} { set found 1 lappend res ${sym}+ break } } if {!$found} { # Identifier. Handled like vari-sized entities in that # multiple occurences are collapsed. lappend res ${idsymbol}+ } if {[info exists ident($phrase)]} { set idx $ident($phrase) } else { set idx [llength $attr] lappend attr $phrase set ident($phrase) $idx } lappend attridx $idx } } } return [list $res $strings [list $attridx $attr]] } namespace eval clex { # Define the lexer symbols for the language 'C', as an example. DefStart DefP ( LPAREN ; DefP ) RPAREN ; DefP -> DEREF DefP < LT ; DefP <= LE ; DefP == EQ DefP > GT ; DefP >= GE ; DefP != NE DefP \[ LBRACKET ; DefP \] RBRACKET ; DefP = ASSIGN DefP \{ LBRACE ; DefP \} RBRACE ; DefP *= MUL_ASSIGN DefP . DOT ; DefP , COMMA ; DefP /= DIV_ASSIGN DefP ++ INCR_OP ; DefP -- DECR_OP ; DefP %= REM_ASSIGN DefP & ADDR_BITAND ; DefP * MULT_STAR ; DefP += PLUS_ASSIGN DefP + PLUS ; DefP - MINUS ; DefP -= MINUS_ASSIGN DefP ~ BITNOT ; DefP ! LOGNOT ; DefP <<= LSHIFT_ASSIGN DefP / DIV ; DefP % REM ; DefP >>= RSHIFT_ASSIGN DefP << LSHIFT ; DefP >> RSHIFT ; DefP &= BITAND_ASSIGN DefP ^ BITEOR ; DefP && LOGAND ; DefP ^= BITEOR_ASSIGN DefP | BITOR ; DefP || LOGOR ; DefP |= BITOR_ASSIGN DefP ? QUERY ; DefP : COLON ; DefP \; SEMICOLON DefP ... ELLIPSIS ; DefP ~= BITNOT_ASSIGN DefK typedef ; DefK extern ; DefK static ; DefK auto ; DefK register DefK void ; DefK char ; DefK short ; DefK int ; DefK long DefK float ; DefK double ; DefK signed ; DefK unsigned DefK goto ; DefK continue ; DefK break ; DefK return DefK case ; DefK default ; DefK switch DefK struct ; DefK union ; DefK enum DefK while ; DefK do ; DefK for DefK const ; DefK volatile DefK if ; DefK else DefK sizeof DefM ::clex::C_comment COM DefM ::clex::C_string STRING_LIT DefM ::clex::C_char CHAR_LIT # Floats containing '.'s have to be matched early because the '.' # is later seen as punctuation. DefM ::clex::C_floatA CONSTANT DefM ::clex::C_floatB CONSTANT DefI IDENT DefWS "\[\000-\040\]+" DefRxM "^0x\[0-9a-fA-F\]+" CONSTANT DefRxM "^\[0-9\]+" CONSTANT DefEnd } proc ::clex::C_comment {subcmd string start} { switch -exact -- $subcmd { find { return [string first "/*" $string $start] } complete { incr start 2 ; # Skip behind /* set stop [string first "*/" $string $start] incr stop 1 ; # Skip to / return $stop } } } proc ::clex::C_string {subcmd string start} { switch -exact -- $subcmd { find { return [string first "\"" $string $start] } complete { # The next vari-sized thing is a "-quoted string. # Finding its end is bit more difficult, because we have # to accept \" as one character inside of the string. " set from $start while 1 { incr from set stop [string first "\"" $string $from] set stopb [string first "\\\"" $string $from] incr stopb if {$stop == $stopb} { set from $stopb ; incr from ; continue } break } return $stop } } } proc ::clex::C_char {subcmd string start} { switch -exact -- $subcmd { find { return [string first "'" $string $start] } complete { # The next vari-sized thing is a '-quoted string. # Finding its end is bit more difficult, because we have # to accept \' as one character inside of the string. " set from $start while 1 { incr from set stop [string first "'" $string $from] set stopb [string first "\\'" $string $from] incr stopb if {$stop == $stopb} { set from $stopb ; incr from ; continue } break } return $stop } } } proc ::clex::C_floatA {subcmd string start} { set pattern {[0-9]*\.[0-9]+([eEdD][+-]?[0-9]+)?} switch -exact -- $subcmd { find { if {[regexp -indices -start $start $pattern $string match]} { #puts ==[string range $string [lindex $match 0] [lindex $match 1]] return [lindex $match 0] } return -1 } complete { if {[regexp -indices -start $start $pattern $string match]} { return [lindex $match 1] } return -1 } } } proc ::clex::C_floatB {subcmd string start} { set pattern {[0-9]+\.[0-9]*([eEdD][+-]?[0-9]+)?} switch -exact -- $subcmd { find { if {[regexp -indices -start $start $pattern $string match]} { #puts ==[string range $string [lindex $match 0] [lindex $match 1]] return [lindex $match 0] } return -1 } complete { if {[regexp -indices -start $start $pattern $string match]} { return [lindex $match 1] } return -1 } } } ---- '''driver''' #!/bin/sh # -*- tcl -*- \ exec tclsh "$0" ${1+"$@"} source clex.tcl # Read file, lex it, time the execution to measure performance set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh] set len [string length $data] set usec [lindex [time {set data [clex::lex $data]}] 0] # Write performance statistics. puts "$fname:" puts "\t$len characters" puts "\tLexing in $usec microseconds" puts "\t = [expr {double($usec)/1000000}] seconds" puts "\t = [expr {double ($usec) / double ($len)}] usec/char" exit # Generate tokenized listing of the input, using the lexing results as input. puts __________________________________________________ foreach {sym str attr} $data break foreach {aidx aval} $attr break foreach {sidx sval} $str break set sv 0 set av 0 foreach s $sym { switch -glob -- $s { *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av} *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv} * {puts "$s"} } } puts __________________________________________________ exit ---- [[ [Scripted Compiler] ]] --> [[ [Parsing C] ]]