Version 8 of Lexing C

August 14, 2002 -- AK

An important part of any Scripted Compiler is the ability actually process the system language underlying the scripting language. In the case of Tcl this is the C Language.

The first step is always to separate the input stream into tokens, each representing one semantic atom. In compiler speak, lexing.

The following script lexes a string containing C source into a list of tokens. Note: The script assumes that the sources are free of preprocessor statements like "#include", "#define", etc. I am also not sure if the list of 'Def'd tokens is actually complete. This requires checking against the C specification.

The script below is only one of many variations, i.e. it can be twiddled in many ways. Examples:

Keep the whitespace as tokens. Might be required for a pretty-printer.
Treat comments as whitespace and remove them. True compiler. Keeping the comments, but not other whitespace as in the script below is more something for a code analyzer looking for additional data (meta-data) in comments. See Source Navigator for a tool in this area.
Use different 'Def's to convert the keywords and punctuation into single byte codes, and refrain from splitting/listifying the result. Sort of a special method for compressing C sources.

The next step will be parsing, i.e. adding structure to the token stream under control of a grammar. An existing tool for that is Yeti. See the C Language for grammar references.

I believe that the method I have used below can be used to lex any system language currently in use today, Pascal, Modula, FORTRAN, C++, ... Again this is something of interest to Source Navigator.

clex.tcl

 # -*- tcl -*-
 # Lexing C

 package provide clex 1.0
 namespace eval  clex {}

 # Most tokens can be recognized and separated from the others via
 # [string map].  The remainder are identifiers.  Except for strings,
 # and comments. These may contain any other token, and we may not act
 # upon them. And both may contain sequences reminiscent of the
 # other. Because of that a first step is to identify and parse out
 # comments and strings, where 'parsing out' means that these tokens
 # are replaced with special char sequences which refer the lexer to a
 # small database. In the final output they are placed back.

 proc clex::SCextract {code mapvar} {
    # Extract strings and comments from the code and place them in mapvar.
    # Replace them with ids

    upvar $mapvar map

    set tmp   ""
    set count 0

    while {1} {
        set comment [string first "/*" $code]
        set string  [string first "\"" $code]

        if {($comment < 0) && ($string < 0)} {
            # No comments nor strings to extract anymore.
            break
        }

        # Debug out
        #puts "$string | $comment | [string length $code]"

        if {
            (($string >= 0) && ($comment >= 0) && ($string < $comment)) ||
            (($string >= 0) && ($comment < 0))
        } {
            # The next vari-sized thing is a "-quoted string.
            # Finding its end is bit more difficult, because we have
            # to accept \" as one character inside of the string.

            set from $string
            while 1 {
                incr from
                set stop  [string first "\""   $code $from]
                set stopb [string first "\\\"" $code $from]
                incr stopb
                if {$stop == $stopb} {set from $stopb ; incr from ; continue}
                break
            }

            set id \000@$count\000
            incr count
            lappend map $id [set sub [string range $code $string $stop]]

            incr stop ; incr string -1
            append tmp [string range $code 0 $string] $id
            set  code  [string range $code $stop end]

            # Debug out
            #puts "\tSTR $id <$sub>"
            continue
        }

        if {
            (($string >= 0) && ($comment >= 0) && ($comment < $string)) ||
            (($comment >= 0) && ($string < 0))
        } {
            # The next vari-sized thing is a comment.

            set  stop [string first "*/" $code $comment]
            incr stop 1

            set id \000@$count\000
            incr count
            lappend map $id [set sub [string range $code $comment $stop]]

            incr stop ; incr comment -1
            append tmp [string range $code 0 $comment] $id
            set  code  [string range $code $stop  end]

            # Debug out
            #puts "\tCMT $id <$sub>"
            continue
        }
        return -code error "Panic, string and comment at some location"
    }
    append  tmp $code
    return $tmp
 }

 proc clex::DefStart {} {
    variable tokens [list]
    #puts "== <$tokens>"
    return
 }

 proc clex::Def {string {replacement {}}} {
    variable tokens
    if {$replacement == {}} {set replacement \000$string\000}
    lappend tokens $string $replacement
    #puts "== <$tokens>"
    return
 }

 proc clex::DefEnd {} {
    variable       tokens
    array set tmp $tokens
    set res [list]
    foreach key [lsort -decreasing [array names tmp]] {
        lappend res $key $tmp($key)
    }
    set tokens $res
    #puts "== <$tokens>"
    return
 }

 proc clex::lex {code} {
    variable tokens

    # Phase I ... Extract strings and comments so that they don't interfere
    #             with the remaining phases.

    # Phase II ... Separate all constant-sized tokens (keywords and
    #              punctuation) from each other.

    # Phase III ... Separate whitespace from the useful text.
    #               Actually converts whitespace into separator characters.

    # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences

    if 0 {
        # Minimal number of commands for all phases

        regsub -all -- "\[\t\n \]+" [string map $tokens \
                [SCextract $code scmap]] \
                \000 tmp

        return [split \
                [string map "\000\000 \000" \
                [string map $scmap \
                $tmp]] \000]
    }
    if 1 {
        # Each phase spelled out explicitly ...

        set code [SCextract          $code scmap]    ; # I
        set code [string map $tokens $code]          ; # II
        regsub -all -- "\[\t\n \]+"  $code \000 code ; # III
        set code [string map $scmap  $code]          ; # IV/a
        set code [string map "\000\000\000 \000 \000\000 \000" $code]  ; # IV/b
        set code [split $code \000]

        return $code
    }
 }

 namespace eval clex {
    DefStart

    Def if        ; Def typedef
    Def else      ; Def struct
    Def while     ; Def union
    Def for       ; Def break
    Def do        ; Def continue
    Def switch    ; Def case
    Def static    ; Def goto
    Def extern    ; Def return

    Def *  ; Def /  ; Def +  ; Def -  ; Def % ;
    Def &  ; Def |  ; Def ^  ; Def << ; Def >>
    Def && ; Def || ; Def ++ ; Def --
    Def == ; Def != ; Def < ; Def > ; Def <= ; Def >=

    Def =  ; Def += ; Def -=  ; Def <<= ; Def >>=
    Def |= ; Def &= ; Def ||= ; Def &&= ; Def *=
    Def /= ; Def %= ; Def ^=

    Def . ; Def \; ; Def , ; Def -> ; Def :

    Def \[ ; Def \]
    Def \{ ; Def \}
    Def (  ; Def )

    DefEnd
 }

driver

 #!/bin/sh
 # -*- tcl -*- \
 exec tclsh "$0" ${1+"$@"}

 source clex.tcl

 set data [read [set fh [open [lindex $argv 0]]]][close $fh]
 set data [clex::lex $data]

 puts __________________________________________________
 #puts $data
 puts [join $data \n]
 puts __________________________________________________
 exit

Project: [ Scripted Compiler ]

Previous: ---

Next: [ Parsing C ]