Version 24 of Lexing C

Updated 2002-10-18 20:07:31

[ Scripted Compiler ] --> [ Parsing C ]


An important part of any Scripted Compiler is the ability actually process the system language underlying the scripting language. In the case of Tcl this is the C Language.

The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, lexing.

The following script lexes a string containing C source into a list of tokens. Note: The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification.

The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples:

  • Keep the whitespace as tokens. Might be required for a pretty-printer.
  • Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See Source Navigator for a tool in this area.
  • Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources.

The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is Yeti. See the C Language for grammar references.

I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to Source Navigator.

Notes

The new version 2 does not reinsert extracted tokens into the source string. It also avoids copying the tail of the string down, which can lead to quadratic behaviour. It is still not optimal, but fairly ok in my book so far. Example result:

 tclIO.c:
        242918 characters
        Lexing in 16498027 microseconds
               =  16.498027 seconds
               =  67.916033394 usec/char

Not bad for a lexer written in a scripting language IMHO. Of course, this is input dependent. tclDecls.h on the other hand is smaller, but takes longer:

 tclDecls.h:
        129917 characters
        Lexing in 18277958 microseconds
               =  18.277958 seconds
               =  140.689501759 usec/char

TODO

  • The recognition of floats as part of the vari-sized stuff slows things down considerably. Removing these definitions we get down to 56.3452070246 usec/char = 13.687265 seconds for the whole tclIO.c. Another regex stage before the punctuation processing might be better.
  • Invert the lexer, process string fragments immediately. This gets rid of the need for a marker character (\001 here) and all that entails.
  • Read up on C syntax. I believe that I currently do not recognize all possible types of numbers.

Here be dragons

1 Tcl 8.4 alpha 3 has a bug in the encoding/utf/string handling subsystem which causes the code here to loose characters. Do not use this alpha version. Upgrade to 8.4.

Example of the problem, provided by PT:

 % clex::lex {int main (int argc, char** argv) { return 0; }}
 int in {} t gc {} ar {} {} gv {} {} return {} {} {} {}

clex.tcl (The code, finally :)

 # -*- tcl -*-
 # Lexing C

 package provide clex 2.0
 namespace eval  clex {}

 # Three stage lexer for C sources.
 # 1. Recognize complex structures, i.e. "strings", 'c'haracters and
 #    /* comments */
 # 2. Recognize punctuation.
 # 3. Recognize keywords.
 # 4. Recognize other constructs via regexes
 # 5. The remainder is seen as identifiers.
 #
 # Punctuation is defined in a general data structure mapping them to
 # symbols. Ditto for keywords. The complex structures are defined via
 # match commands.

 proc clex::DefStart {} {
     variable punctuation [list]
     variable keywords    [list]
     variable matchers    [list]
     variable idsymbol    ""
     variable ws          ""
     variable rxmatchers  [list]
     return
 }

 proc clex::DefWS {wspattern} {
     variable ws $wspattern
     return
 }

 proc clex::DefI {replacement} {
     # I = Identifier.
     # Define the symbol to use for non-keyword identifiers.
     variable idsymbol $replacement
     return
 }

 proc clex::DefP {string {replacement {}}} {
     # P = Punctuation

     if {$replacement == {}} {
         set replacement $string
     }

     variable punctuation
     lappend  punctuation $string \000\001$replacement\000
     return
 }

 proc clex::DefK {string {replacement {}}} {
     # K = Keyword

     if {$replacement == {}} {
         set replacement [string toupper $string]
     }

     variable keywords
     lappend  keywords $string $replacement
     return
 }

 proc clex::DefM {cmd {symbol {}}} {
     # M = Matcher

     if {$symbol == {}} {
         set symbol $cmd
     }

     variable matchers
     lappend  matchers $cmd $symbol
     return
 }

 proc clex::DefRxM {pattern {symbol {}}} {
     # RxM = Regex Matcher

     if {$symbol == {}} {
         set symbol $pattern
     }

     variable rxmatchers
     lappend  rxmatchers $pattern $symbol
     return
 }


 proc clex::DefEnd {} {
     variable punctuation
     variable keywords
     variable matchsym
     variable matchers

     # Sort punctuation alphabetically, incidentially by length, place
     # the longest first. That way the lexer will correctly distinguish
     # between sequences like '>' and '>='.

     array set tmp $punctuation
     set res [list]
     foreach key [lsort -decreasing [array names tmp]] {
         lappend res $key $tmp($key)
     }
     set punctuation $res

     # The keywords we place into an array to allow an easy checking of
     # identifiers for keywordiness.

     unset tmp ; set tmp $keywords
     unset keywords
     array set keywords $tmp

     # Matchers are executed in the same order
     # they were defined in. We do hash to their
     # symbols.

     set res [list]
     foreach {cmd sym} $matchers {
         lappend res $cmd
         set matchsym($cmd) $sym
     }
     set matchers $res
     return
 }


 proc ::clex::Match {string} {
     # Executes the matchers repeatedly until no match is found anymore.
     #
     # Returns a 3-element list. First element is a list of string
     # fragments and symbols. Symbols have \001 as leading characters
     # to distinguish them from the unprocessed string fragments. They
     # also have a '-' character as suffix. See the documentation of
     # the main lexer command for an explanation. The second element is
     # a list of 2 elements, again. The second element is a list of all
     # unique matching entities. The first element is a list of indices
     # into the the second element. In this way multiple occurences of
     # the same entities are collapsed.

     variable matchers
     variable matchsym

     set res    [list]
     set values [list]
     set validx [list]
     array set ent {}

     set start 0
     set end [string length $string]

     # The call to matchers are optimized. Initially each matcher gives
     # a guess where the next occurence of a pattern owned by him
     # is. In each iteration the nearest pattern is selected, completed
     # and processed. Afterward each matchers whose guess was before
     # the end of the current pattern has to reguess based upon the new
     # starting location. Why so "complicated" ? Performance. Example:

     # Let us assume that we have 10 "strings" at the beginning, some
     # substantial uncommented code, followed by a /* comment */ at the
     # end.

     # A simple implementation querying each matcher in each iteration
     # will not only find the string at the very beginning but will
     # also every time search through the whole string to find the
     # comment at the end. This amounts to overall quadratic behaviour.

     # The chosen implementation on the other hand remembers the old
     # guesses and invalidates them only if the last best guess went
     # beyond that, indicating an overlap. In the example above this
     # means that the string guess is invalidated and renewed
     # everytime, but the comment guess is done once and nver touched
     # until the strings are exhausted.

     # Processing tcl/generic/tclIO.c (263252 characters) takes only
     # 10 seconds, 7 less than the 17 seconds the simple algorithm
     # used to do its work.


     array set loc {}
     foreach cmd $matchers {
         set loc($cmd) [$cmd find $string $start]
         #puts "$cmd = $loc($cmd)"
     }

     while {1} {
         # Run through each matcher and ask them for the nearest
         # occurence of a string handled by them. We abort the
         # search early if we find a match directly at the beginning
         # of the string. We stop the whole processing if no match
         # found.

         #puts "$start | $end"

         set nearest $end
         set mcmd {}
         foreach cmd $matchers {
             set begin $loc($cmd)

             #puts "\t$cmd = $begin"

             if {$begin >= 0 && $begin < $nearest} {
                 set nearest $begin
                 set mcmd $cmd
             }
             if {$nearest == $start} {break}
         }

         #puts "... $nearest $mcmd"

         if {$nearest == $end} {
             #puts "___ done"
             incr nearest -1
             if {$start < $nearest} {
                 lappend res [string range $string $start end]
             }
             break
         }

         # A match was found. We now ask the matcher to deliver the end
         # of the match. This is then used to slice the input into the
         # relevant pieces.

         set stop [$mcmd complete $string $nearest]

         # Note that the code below causes conversion to UTF16 for easy
         # indexing into the string. It also copies only the relevant
         # portions of the string, and avoids to copy down the
         # remainder of the string, something which would incur a heavy
         # performance penalty (quadratic behaviour).

         if {$nearest > 0} {
             incr nearest -1
             lappend res [string range $string $start $nearest]
             incr nearest
         }

         lappend res \001$matchsym($mcmd)-

         set value [string range $string $nearest $stop]
         if {[info exists ent($value)]} {
             set idx $ent($value)
         } else {
             set idx [llength $values]
             lappend values $value
             set ent($value) $idx
         }
         lappend validx $idx
         set  start $stop
         incr start

         #puts "... $nearest ... $stop ($value) $mcmd $matchsym($mcmd)"

         # And go looking for more matches.
         # Update invalidated guesses.
        # Performance fix: We do not have to renew matchers which already
        # told us that the remainder of the input does not contain anything
        # they can match.

         foreach cmd $matchers {
            set begin $loc($cmd)
             if {($begin >= 0) && ($begin < $start)} {
                 set loc($cmd) [$cmd find $string $start]
                 #puts "$cmd = $loc($cmd) renew"
             }
         }
     }
     return [list $res [list $validx $values]]
 }


 # The command will return a list containing 3 elements, each of them
 # lists in their own right.
 #
 # 1. Symbols. A suffix of '+' means that the symbol has an attribute
 #    value. The suffix '-' has the same meaning, except that the value
 #    has to be taken from the list of strings.
 # 2. A list of strings and other data, as recognized in phase one.
 #    Symbols with a string value are suffixed by '-', s.a.
 # 3. A list of symbol attribute values. Symbols with such a value
 #    are suffixed by '+', s.a.
 #

 proc clex::lex {code} {
     variable punctuation
     variable keywords
     variable idsymbol
     variable ws
     variable rxmatchers

     #puts stderr "----$ws"

     # Extract complex vari-sized entities, like comments, strings, and
     # character data.

     set res  [list]
     set attr [list]
     set attridx [list]
     array set ident {}

     foreach {fragments strings} [Match $code] break

     foreach string $fragments {
         if {[string equal \001 [string index $string 0]]} {
             # Symbols are passing through, just strip off their marker
             # character
             lappend res [string range $string 1 end]
             continue
         }

         # String fragment, have to find punctuation, keywords, and
         # identifiers. Punctuation is [string map]-separated from the
         # rest. Then whitespace is normalized. At last we split and
         # iterate over the result, detecting keywords as we go.

         foreach item [split [string map $punctuation $string] \000] {
             # Ignore empty stuff, artefact of the punct. and whitespace
             # handling
             if {[string length $item] == 0} {continue}

             #puts stderr "__ $item __"

             # Pass Punctuation symbols
             if {[string equal \001 [string index $item 0]]} {
                 lappend res [string range $item 1 end]
                 continue
             }

             regsub -all -- $ws $item \000 tmp

             #puts stderr "________ $tmp __"

             foreach phrase [split $tmp \000] {
                 if {[string length $phrase] == 0} {continue}

                 # Recognize keywords
                 if {[info exists keywords($phrase)]} {
                     lappend res $keywords($phrase)
                     continue
                 }

                 # Go through additional regexes to see if there are
                 # special symbols which are neither keywords, nor
                 # identifiers. Like numeric constants. Whatever is
                 # matched, the phrase is added to the phrase pool.

                 set found 0
                 foreach {p sym} $rxmatchers {
                     if {[regexp -- $p $phrase]} {
                         set found 1
                         lappend res ${sym}+
                         break
                     }
                 }
                 if {!$found} {
                     # Identifier. Handled like vari-sized entities in that
                     # multiple occurences are collapsed.

                     lappend res ${idsymbol}+
                 }

                 if {[info exists ident($phrase)]} {
                     set idx $ident($phrase)
                 } else {
                     set idx [llength $attr]
                     lappend attr $phrase
                     set ident($phrase) $idx
                 }
                 lappend attridx $idx
             }
         }
     }

     return [list $res $strings [list $attridx $attr]]
 }

 namespace eval clex {
     # Define the lexer symbols for the language 'C', as an example.

     DefStart

     DefP (   LPAREN      ; DefP )  RPAREN    ; DefP ->  DEREF
     DefP <   LT          ; DefP <= LE        ; DefP ==  EQ
     DefP >   GT          ; DefP >= GE        ; DefP !=  NE
     DefP \[  LBRACKET    ; DefP \] RBRACKET  ; DefP =   ASSIGN
     DefP \{  LBRACE      ; DefP \} RBRACE    ; DefP *=  MUL_ASSIGN
     DefP .   DOT         ; DefP ,  COMMA     ; DefP /=  DIV_ASSIGN
     DefP ++  INCR_OP     ; DefP -- DECR_OP   ; DefP %=  REM_ASSIGN
     DefP &   ADDR_BITAND ; DefP *  MULT_STAR ; DefP +=  PLUS_ASSIGN
     DefP +   PLUS        ; DefP -  MINUS     ; DefP -=  MINUS_ASSIGN
     DefP ~   BITNOT      ; DefP !  LOGNOT    ; DefP <<= LSHIFT_ASSIGN
     DefP /   DIV         ; DefP %  REM       ; DefP >>= RSHIFT_ASSIGN
     DefP <<  LSHIFT      ; DefP >> RSHIFT    ; DefP &=  BITAND_ASSIGN
     DefP ^   BITEOR      ; DefP && LOGAND    ; DefP ^=  BITEOR_ASSIGN
     DefP |   BITOR       ; DefP || LOGOR     ; DefP |=  BITOR_ASSIGN
     DefP ?   QUERY       ; DefP :  COLON     ; DefP \;  SEMICOLON
     DefP ... ELLIPSIS    ; DefP ~= BITNOT_ASSIGN

     DefK typedef ; DefK extern   ; DefK static ; DefK auto ; DefK register
     DefK void    ; DefK char     ; DefK short  ; DefK int  ; DefK long
     DefK float   ; DefK double   ; DefK signed ; DefK unsigned
     DefK goto    ; DefK continue ; DefK break  ; DefK return
     DefK case    ; DefK default  ; DefK switch
     DefK struct  ; DefK union    ; DefK enum
     DefK while   ; DefK do       ; DefK for
     DefK const   ; DefK volatile
     DefK if      ; DefK else
     DefK sizeof

     DefM ::clex::C_comment COM
     DefM ::clex::C_string  STRING_LIT
     DefM ::clex::C_char    CHAR_LIT

     # Floats containing '.'s have to be matched early because the '.'
     # is later seen as punctuation.

     DefM ::clex::C_floatA   CONSTANT
     DefM ::clex::C_floatB   CONSTANT

     DefI IDENT
     DefWS "\[\000-\040\]+"

     DefRxM "^0x\[0-9a-fA-F\]+" CONSTANT
     DefRxM "^\[0-9\]+"         CONSTANT

     DefEnd
 }

 proc ::clex::C_comment {subcmd string start} {
     switch -exact -- $subcmd {
         find {
             return [string first "/*" $string $start]
         }
         complete {
             incr start 2 ; # Skip behind /*
             set  stop [string first "*/" $string $start]
             incr stop 1 ; # Skip to /
             return $stop
         }
     }
 }

 proc ::clex::C_string {subcmd string start} {
     switch -exact -- $subcmd {
         find {
             return [string first "\"" $string $start]
         }
         complete {
             # The next vari-sized thing is a "-quoted string.
             # Finding its end is bit more difficult, because we have
             # to accept \" as one character inside of the string. "

             set from $start
             while 1 {
                 incr from
                 set stop  [string first "\""   $string $from]
                 set stopb [string first "\\\"" $string $from]
                 incr stopb
                 if {$stop == $stopb} {
                     set from $stopb ; incr from ; continue
                 }
                 break
             }
             return $stop
         }
     }
 }

 proc ::clex::C_char {subcmd string start} {
     switch -exact -- $subcmd {
         find {
             return [string first "'" $string $start]
         }
         complete {
             # The next vari-sized thing is a '-quoted string.
             # Finding its end is bit more difficult, because we have
             # to accept \' as one character inside of the string. "

             set from $start
             while 1 {
                 incr from
                 set stop  [string first "'"   $string $from]
                 set stopb [string first "\\'" $string $from]
                 incr stopb
                 if {$stop == $stopb} {
                     set from $stopb ; incr from ; continue
                 }
                 break
             }
             return $stop
         }
     }
 }

 proc ::clex::C_floatA {subcmd string start} {
     set pattern {[0-9]*\.[0-9]+([eEdD][+-]?[0-9]+)?}
     switch -exact -- $subcmd {
         find {
             if {[regexp -indices -start $start $pattern $string match]} {
                 #puts ==[string range $string [lindex $match 0] [lindex $match 1]]
                 return [lindex $match 0]
             }
             return -1
         }
         complete {
             if {[regexp -indices -start $start $pattern $string match]} {
                 return [lindex $match 1]
             }
             return -1
         }
     }
 }

 proc ::clex::C_floatB {subcmd string start} {
     set pattern {[0-9]+\.[0-9]*([eEdD][+-]?[0-9]+)?}
     switch -exact -- $subcmd {
         find {
             if {[regexp -indices -start $start $pattern $string match]} {
                 #puts ==[string range $string [lindex $match 0] [lindex $match 1]]
                 return [lindex $match 0]
             }
             return -1
         }
         complete {
             if {[regexp -indices -start $start $pattern $string match]} {
                 return [lindex $match 1]
             }
             return -1
         }
     }
 }

driver

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh "$0" ${1+"$@"}

 source clex.tcl

 # Read file, lex it, time the execution to measure performance

 set data [read [set fh [open [set fname [lindex $argv 0]]]]][close $fh]
 set len  [string length $data]
 set usec [lindex [time {set data [clex::lex $data]}] 0]

 # Write performance statistics.

 puts "$fname:"
 puts "\t$len characters"
 puts "\tLexing in $usec microseconds"
 puts "\t       =  [expr {double($usec)/1000000}] seconds"
 puts "\t       =  [expr {double ($usec) / double ($len)}] usec/char"

 exit

 # Generate tokenized listing of the input, using the lexing results as input.

 puts __________________________________________________
 foreach {sym str attr} $data break
 foreach {aidx aval} $attr break
 foreach {sidx sval} $str break

 set sv 0
 set av 0
 foreach s $sym {
     switch -glob -- $s {
         *+ {puts "$s <<[lindex $aval [lindex $aidx $av]]>>" ; incr av}
         *- {puts "$s <<[lindex $sval [lindex $sidx $sv]]>>" ; incr sv}
         *  {puts "$s"}
     }
 }

 puts __________________________________________________
 exit

[ Scripted Compiler ] --> [ Parsing C ]