Version 19 of Lexing C

Updated 2002-09-09 08:03:05

August 14, 2002 -- AK | [ Scripted Compiler ] --> [ Parsing C ]


An important part of any Scripted Compiler is the ability actually process the system language underlying the scripting language. In the case of Tcl this is the C Language.

The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, lexing.

The following script lexes a string containing C source into a list of tokens. Note: The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification.

The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples:

  • Keep the whitespace as tokens. Might be required for a pretty-printer.
  • Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See Source Navigator for a tool in this area.
  • Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources.

The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is Yeti. See the C Language for grammar references.

I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to Source Navigator.


PT writes: This is very interesting stuff. I've tried this lexer on some test C using tcl 8.4a4 under Windows 2K it doesn't work. First, there is a missing 'scmap' declaration in the lex procedure but more significantly using \000 as the separator causes it to loose characters. For example:

 % clex::lex {int main (int argc, char** argv) { return 0; }}
 int in {} t gc {} ar {} {} gv {} {} return {} {} {} {}

I think it could be misreading some of these as unicode characters and thus loosing bits. Changing the separator character to \uffee (a unicode character only used for byte ordering) fixes the problem. There might be better separator characters that this but \uffee is certain not to be in your text. It's only allowed in first 2 bytes of a file.

Now, I wonder if we can make this lex tcl..... :)

AK: I thought first that the missing declaration of scmap does not matter, as it is created by the 'upvar' in the SCextract command, and thus filled before use, but now I see it. If neither strings nor comments were found the variable is not initialized. Right. Thanks. I fixed it now.

Regarding the missing characters ... Using tcl 8.3.4 I get

 % clex::lex {int main (int argc, char** argv) { return 0; }}
 int main ( int argc , char * * argv ) \{ return 0 {;} \} {}

Could we have found a problem in tcl 8.4 ? Can you try using the CVS head ?

Note that legal C source code cannot contain \000 as literal character. It shouldn't be an UTF problem because for UTF8 the 7bit ASCII characters are a subset.

AK - Aug 23, 2002: The reported problem was definitely only in the alpha 3. Neither beta 2 nor CVS head exhibit the problem. Wheew.

More notes: The reinsertion of comments and strings into the source, before the listification, is sub-optimal. We should keep the short id;s and insert them in a postprocessing pass of the list. Thay we keep the constructed objects, don't shift long character sequences around the memory, ...


clex.tcl

 # -*- tcl -*-
 # Lexing C

 package provide clex 1.0
 namespace eval  clex {}

 # Most tokens can be recognized and separated from the others via
 # [string map].  The remainder are identifiers.  Except for strings,
 # and comments. These may contain any other token, and we may not act
 # upon them. And both may contain sequences reminiscent of the
 # other. Because of that a first step is to identify and parse out
 # comments and strings, where 'parsing out' means that these tokens
 # are replaced with special char sequences which refer the lexer to a
 # small database. In the final output they are placed back.

 proc clex::SCextract {code mapvar} {
    # Extract strings and comments from the code and place them in mapvar.
    # Replace them with ids

    upvar $mapvar map

    set tmp   ""
    set count 0

    while {1} {
        set comment [string first "/*" $code]
        set string  [string first "\"" $code]

        if {($comment < 0) && ($string < 0)} {
            # No comments nor strings to extract anymore.
            break
        }

        # Debug out
        #puts "$string | $comment | [string length $code]"

        if {
            (($string >= 0) && ($comment >= 0) && ($string < $comment)) ||
            (($string >= 0) && ($comment < 0))
        } {
            # The next vari-sized thing is a "-quoted string.
            # Finding its end is bit more difficult, because we have
            # to accept \" as one character inside of the string.

            set from $string
            while 1 {
                incr from
                set stop  [string first "\""   $code $from]
                set stopb [string first "\\\"" $code $from]
                incr stopb
                if {$stop == $stopb} {set from $stopb ; incr from ; continue}
                break
            }

            set id \000@$count\000
            incr count
            lappend map $id [set sub [string range $code $string $stop]]

            incr stop ; incr string -1
            append tmp [string range $code 0 $string] $id
            set  code  [string range $code $stop end]

            # Debug out
            #puts "\tSTR $id <$sub>"
            continue
        }

        if {
            (($string >= 0) && ($comment >= 0) && ($comment < $string)) ||
            (($comment >= 0) && ($string < 0))
        } {
            # The next vari-sized thing is a comment.

            set  stop [string first "*/" $code $comment]
            incr stop 1

            set id \000@$count\000
            incr count
            lappend map $id [set sub [string range $code $comment $stop]]

            incr stop ; incr comment -1
            append tmp [string range $code 0 $comment] $id
            set  code  [string range $code $stop  end]

            # Debug out
            #puts "\tCMT $id <$sub>"
            continue
        }
        return -code error "Panic, string and comment at some location"
    }
    append  tmp $code
    return $tmp
 }

 proc clex::DefStart {} {
    variable tokens [list]
    #puts "== <$tokens>"
    return
 }

 proc clex::Def {string {replacement {}}} {
    variable tokens
    if {$replacement == {}} {set replacement \000$string\000}
    lappend tokens $string $replacement
    #puts "== <$tokens>"
    return
 }

 proc clex::DefEnd {} {
    variable       tokens
    array set tmp $tokens
    set res [list]
    foreach key [lsort -decreasing [array names tmp]] {
        lappend res $key $tmp($key)
    }
    set tokens $res
    #puts "== <$tokens>"
    return
 }

 proc clex::lex {code} {
    variable tokens

    # Phase I ... Extract strings and comments so that they don't interfere
    #             with the remaining phases.

    # Phase II ... Separate all constant-sized tokens (keywords and
    #              punctuation) from each other.

    # Phase III ... Separate whitespace from the useful text.
    #               Actually converts whitespace into separator characters.

    # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences

    set scmap [list]
    if 0 {
        # Minimal number of commands for all phases

        regsub -all -- "\[\t\n \]+" [string map $tokens \
                [SCextract $code scmap]] \
                \000 tmp

        return [split \
                [string map "\000\000 \000" \
                [string map $scmap \
                $tmp]] \000]
    }
    if 1 {
        # Each phase spelled out explicitly ...

        set code [SCextract          $code scmap]    ; # I
        set code [string map $tokens $code]          ; # II
        regsub -all -- "\[\t\n \]+"  $code \000 code ; # III
        set code [string map $scmap  $code]          ; # IV/a
        set code [string map "\000\000\000 \000 \000\000 \000" $code]  ; # IV/b
        set code [split $code \000]

        return $code
    }
 }

 namespace eval clex {
    DefStart

    Def if        ; Def typedef
    Def else      ; Def struct
    Def while     ; Def union
    Def for       ; Def break
    Def do        ; Def continue
    Def switch    ; Def case
    Def static    ; Def goto
    Def extern    ; Def return

    Def *  ; Def /  ; Def +  ; Def -  ; Def % ;
    Def &  ; Def |  ; Def ^  ; Def << ; Def >>
    Def && ; Def || ; Def ++ ; Def --
    Def == ; Def != ; Def < ; Def > ; Def <= ; Def >=

    Def =  ; Def += ; Def -=  ; Def <<= ; Def >>=
    Def |= ; Def &= ; Def ||= ; Def &&= ; Def *=
    Def /= ; Def %= ; Def ^=

    Def . ; Def \; ; Def , ; Def -> ; Def :

    Def \[ ; Def \]
    Def \{ ; Def \}
    Def (  ; Def )

    DefEnd
 }

driver

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh "$0" ${1+"$@"}

 source clex.tcl

 set data [read [set fh [open [lindex $argv 0]]]][close $fh]
 set data [clex::lex $data]

 puts __________________________________________________
 #puts $data
 puts [join $data \n]
 puts __________________________________________________
 exit

GPS: It seems that 1.0 has a bug. When I run the driver above on tclDecls.h (from 8.3.4) with tclsh8.3 (version 8.3.4) or tclsh8.4 (from CVS a few weeks ago) I find that Tcl_NotifierProcs is coming out as "ierProcs" I can't seem to figure out what is going wrong. Do you have any ideas Andreas?

AK A few weeks ago = tcl 8.4a3 ? See above that there was a unicode problem. Does it happen with the current CVS too ?

RS "ierProcs" seems to be cut off precisely after if - are keywords extracted from the middle of command names?


[ Scripted Compiler ] --> [ Parsing C ]