[[ [Scripted Compiler] ]] --> [[ [Parsing C] ]] ---- An important part of any [Scripted Compiler] is the ability actually process the system language underlying the scripting language. In the case of [Tcl] this is the '''[C Language]'''. The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, ''lexing''. The following script lexes a string containing C source into a list of tokens. '''Note:''' The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification. The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples: * Keep the whitespace as tokens. Might be required for a pretty-printer. * Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See [Source Navigator] for a tool in this area. * Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources. The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is [Yeti]. See the [C Language] for grammar references. I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to [Source Navigator]. '''Notes''' Reinsertion of comments and strings into the source, as currently done by the code, before the listification, is sub-optimal. We should keep the ''short'' id's and insert them in a postprocessing pass of the list. Thay we keep the constructed objects, don't shift long character sequences around the memory, ... Or keep the comments and strings separate, just leave a place holder, and let the user of the datastructure pick the value from the appropriate place. ---- '''Here be dragons''' '''1''' Tcl 8.4 alpha 3 has a bug in the encoding/utf/string handling subsystem which causes the code here to loose characters. Do not use this alpha version. Upgrade to 8.4. Example of the problem, provided by [PT]: % clex::lex {int main (int argc, char** argv) { return 0; }} int in {} t gc {} ar {} {} gv {} {} return {} {} {} {} '''2''' The code as is is buggy. Run it using the file ''tclDecls.h'' (from the sources of the tcl core) as input and you will find that the word ''Tcl_NotifierProcs'' is coming out as "ierProcs". The reason is that the code as is recognizes keywords, like [if], even inside of a word, cutting the actual identifier apart. Thanks to [GPS] for finding this, and [RS] for the explanation. The main reason for the problem is that I try to recognize keywords like punctiation. To keep this [[string map]] has to be changed to regexes, as they are able to recognize word boundaries. Or keep [[string map] for the pure punctuation and recognize keywords later on, after splitting inter-punctuation strings along whitespace. A general lexer will then have three essential stages. 1. Recognize complex, potentially nested and overlapping character sequences, like strings, character data, and comments. 1. Recognize punctuation. These are character sequences which always interup other sequences, regard where they are. This should make clear that a keyword like '''if'' is not punctuaction as its presence inside of a word must not interrupt that word. 1. Recognize keywords. ---- '''clex.tcl''' (The code, finally :) # -*- tcl -*- # Lexing C package provide clex 1.0 namespace eval clex {} # Most tokens can be recognized and separated from the others via # [string map]. The remainder are identifiers. Except for strings, # and comments. These may contain any other token, and we may not act # upon them. And both may contain sequences reminiscent of the # other. Because of that a first step is to identify and parse out # comments and strings, where 'parsing out' means that these tokens # are replaced with special char sequences which refer the lexer to a # small database. In the final output they are placed back. proc clex::SCextract {code mapvar} { # Extract strings and comments from the code and place them in mapvar. # Replace them with ids upvar $mapvar map set tmp "" set count 0 while {1} { set comment [string first "/*" $code] set string [string first "\"" $code] if {($comment < 0) && ($string < 0)} { # No comments nor strings to extract anymore. break } # Debug out #puts "$string | $comment | [string length $code]" if { (($string >= 0) && ($comment >= 0) && ($string < $comment)) || (($string >= 0) && ($comment < 0)) } { # The next vari-sized thing is a "-quoted string. # Finding its end is bit more difficult, because we have # to accept \" as one character inside of the string. set from $string while 1 { incr from set stop [string first "\"" $code $from] set stopb [string first "\\\"" $code $from] incr stopb if {$stop == $stopb} {set from $stopb ; incr from ; continue} break } set id \000@$count\000 incr count lappend map $id [set sub [string range $code $string $stop]] incr stop ; incr string -1 append tmp [string range $code 0 $string] $id set code [string range $code $stop end] # Debug out #puts "\tSTR $id <$sub>" continue } if { (($string >= 0) && ($comment >= 0) && ($comment < $string)) || (($comment >= 0) && ($string < 0)) } { # The next vari-sized thing is a comment. set stop [string first "*/" $code $comment] incr stop 1 set id \000@$count\000 incr count lappend map $id [set sub [string range $code $comment $stop]] incr stop ; incr comment -1 append tmp [string range $code 0 $comment] $id set code [string range $code $stop end] # Debug out #puts "\tCMT $id <$sub>" continue } return -code error "Panic, string and comment at some location" } append tmp $code return $tmp } proc clex::DefStart {} { variable tokens [list] #puts "== <$tokens>" return } proc clex::Def {string {replacement {}}} { variable tokens if {$replacement == {}} {set replacement \000$string\000} lappend tokens $string $replacement #puts "== <$tokens>" return } proc clex::DefEnd {} { variable tokens array set tmp $tokens set res [list] foreach key [lsort -decreasing [array names tmp]] { lappend res $key $tmp($key) } set tokens $res #puts "== <$tokens>" return } proc clex::lex {code} { variable tokens # Phase I ... Extract strings and comments so that they don't interfere # with the remaining phases. # Phase II ... Separate all constant-sized tokens (keywords and # punctuation) from each other. # Phase III ... Separate whitespace from the useful text. # Actually converts whitespace into separator characters. # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences set scmap [list] if 0 { # Minimal number of commands for all phases regsub -all -- "\[\t\n \]+" [string map $tokens \ [SCextract $code scmap]] \ \000 tmp return [split \ [string map "\000\000 \000" \ [string map $scmap \ $tmp]] \000] } if 1 { # Each phase spelled out explicitly ... set code [SCextract $code scmap] ; # I set code [string map $tokens $code] ; # II regsub -all -- "\[\t\n \]+" $code \000 code ; # III set code [string map $scmap $code] ; # IV/a set code [string map "\000\000\000 \000 \000\000 \000" $code] ; # IV/b set code [split $code \000] return $code } } namespace eval clex { DefStart Def if ; Def typedef Def else ; Def struct Def while ; Def union Def for ; Def break Def do ; Def continue Def switch ; Def case Def static ; Def goto Def extern ; Def return Def * ; Def / ; Def + ; Def - ; Def % ; Def & ; Def | ; Def ^ ; Def << ; Def >> Def && ; Def || ; Def ++ ; Def -- Def == ; Def != ; Def < ; Def > ; Def <= ; Def >= Def = ; Def += ; Def -= ; Def <<= ; Def >>= Def |= ; Def &= ; Def ||= ; Def &&= ; Def *= Def /= ; Def %= ; Def ^= Def . ; Def \; ; Def , ; Def -> ; Def : Def \[ ; Def \] Def \{ ; Def \} Def ( ; Def ) DefEnd } ---- '''driver''' #!/bin/sh # -*- tcl -*- \ exec tclsh "$0" ${1+"$@"} source clex.tcl set data [read [set fh [open [lindex $argv 0]]]][close $fh] set data [clex::lex $data] puts __________________________________________________ #puts $data puts [join $data \n] puts __________________________________________________ exit ---- [[ [Scripted Compiler] ]] --> [[ [Parsing C] ]]