August 14, 2002 -- [AK] ---- An important part of any [Scripted Compiler] is the ability actually process the system language underlying the scripting language. In the case of [Tcl] this is '''C'''. The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, ''lexing''. The following script lexes a string containing C source into a list of tokens. '''Note:''' The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification. The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples: * Keep the whitespace as tokens. Might be required for a pretty-printer. * Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. * Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources. The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is [Yeti]. ---- '''clex.tcl''' # -*- tcl -*- # Lexing C package provide clex 1.0 namespace eval clex {} # Most tokens can be recognized and separated from the others via # [string map]. The remainder are identifiers. Except for strings, # and comments. These may contain any other token, and we may not act # upon them. And both may contain sequences reminiscent of the # other. Because of that a first step is to identify and parse out # comments and strings, where 'parsing out' means that these tokens # are replaced with special char sequences which refer the lexer to a # small database. In the final output they are placed back. proc clex::SCextract {code mapvar} { # Extract strings and comments from the code and place them in mapvar. # Replace them with ids upvar $mapvar map set tmp "" set count 0 while {1} { set comment [string first "/*" $code] set string [string first "\"" $code] if {($comment < 0) && ($string < 0)} { # No comments nor strings to extract anymore. break } # Debug out #puts "$string | $comment | [string length $code]" if { (($string >= 0) && ($comment >= 0) && ($string < $comment)) || (($string >= 0) && ($comment < 0)) } { # The next vari-sized thing is a "-quoted string. # Finding its end is bit more difficult, because we have # to accept \" as one character inside of the string. set from $string while 1 { incr from set stop [string first "\"" $code $from] set stopb [string first "\\\"" $code $from] incr stopb if {$stop == $stopb} {set from $stopb ; incr from ; continue} break } set id \000@$count\000 incr count lappend map $id [set sub [string range $code $string $stop]] incr stop ; incr string -1 append tmp [string range $code 0 $string] $id set code [string range $code $stop end] # Debug out #puts "\tSTR $id <$sub>" continue } if { (($string >= 0) && ($comment >= 0) && ($comment < $string)) || (($comment >= 0) && ($string < 0)) } { # The next vari-sized thing is a comment. set stop [string first "*/" $code $comment] incr stop 1 set id \000@$count\000 incr count lappend map $id [set sub [string range $code $comment $stop]] incr stop ; incr comment -1 append tmp [string range $code 0 $comment] $id set code [string range $code $stop end] # Debug out #puts "\tCMT $id <$sub>" continue } return -code error "Panic, string and comment at some location" } append tmp $code return $tmp } proc clex::DefStart {} { variable tokens [list] #puts "== <$tokens>" return } proc clex::Def {string {replacement {}}} { variable tokens if {$replacement == {}} {set replacement \000$string\000} lappend tokens $string $replacement #puts "== <$tokens>" return } proc clex::DefEnd {} { variable tokens array set tmp $tokens set res [list] foreach key [lsort -decreasing [array names tmp]] { lappend res $key $tmp($key) } set tokens $res #puts "== <$tokens>" return } proc clex::lex {code} { variable tokens # Phase I ... Extract strings and comments so that they don't interfere # with the remaining phases. # Phase II ... Separate all constant-sized tokens (keywords and # punctuation) from each other. # Phase III ... Separate whitespace from the useful text. # Actually converts whitespace into separator characters. # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences if 0 { # Minimal number of commands for all phases regsub -all -- "\[\t\n \]+" [string map $tokens \ [SCextract $code scmap]] \ \000 tmp return [split \ [string map "\000\000 \000" \ [string map $scmap \ $tmp]] \000] } if 1 { # Each phase spelled out explicitly ... set code [SCextract $code scmap] ; # I set code [string map $tokens $code] ; # II regsub -all -- "\[\t\n \]+" $code \000 code ; # III set code [string map $scmap $code] ; # IV/a set code [string map "\000\000\000 \000 \000\000 \000" $code] ; # IV/b set code [split $code \000] return $code } } namespace eval clex { DefStart Def if ; Def typedef Def else ; Def struct Def while ; Def union Def for ; Def break Def do ; Def continue Def switch ; Def case Def static ; Def goto Def extern ; Def return Def * ; Def / ; Def + ; Def - ; Def % ; Def & ; Def | ; Def ^ ; Def << ; Def >> Def && ; Def || ; Def ++ ; Def -- Def == ; Def != ; Def < ; Def > ; Def <= ; Def >= Def = ; Def += ; Def -= ; Def <<= ; Def >>= Def |= ; Def &= ; Def ||= ; Def &&= ; Def *= Def /= ; Def %= ; Def ^= Def . ; Def \; ; Def , ; Def -> Def \[ ; Def \] Def \{ ; Def \} Def ( ; Def ) DefEnd } ---- '''driver''' #!/bin/sh # -*- tcl -*- \ exec tclsh "$0" ${1+"$@"} source clex.tcl set data [read [set fh [open [lindex $argv 0]]]][close $fh] set data [clex::lex $data] puts __________________________________________________ #puts $data puts [join $data \n] puts __________________________________________________ exit ----