Version 8 of Parsing C

Aug 17, 2002 -- AK | [ Lexing C ] <-- [ Scripted Compiler ]

Based upon Lexing C this page introduces the main components of a parser for the C language. This page is part of the Scripted Compiler project.

The lexer shown here is a variant of the original lexer defined in Lexing C. The changes are:

It ignores C comments.
Keywords and punctuation are converted into true lexer symbols.
All internal lexer symbols can be distinguished through their first character. Keywords and punctuation are for example introduced by \001.
The list generated by the older lexer is not directly used as result, but processed again, into a list of pairs, lexer symbols and their associated lexer data (value of literals and constants, the name of an identifier, ...).

Beyond the lexer you will find a grammar definition for C useable by Yeti. This definition is based upon [L1 ]. Running it through Yeti will generate a LALR(1) shift/reduce parser for C.

Not shown on this page:

The glue between the token list and the yeti-parser. This scanner has to perform one additional transformation, the conversion of identifiers for type names into the symbol TYPE_NAME. Why this has to be done is explained in the section defining the grammar.
The code to execute when rules are reduced by the parser. This code has to maintain at least a symbol-table, to support the lexer in his task, see above. Beyond that nearly anything will go. This could be a direct translation to machine code, the generation of an exact parse tree, generation of an abstract syntax tree, or whatever your imagination will come up with.
The generated parser. It comes in at 105 Kilobyte of Tcl code and is thus much to large for this page. And it can be generated anyways via Yeti, so there is really no reason to place it here.

AK - Aug 23, 2002: More notes: The reinsertion of strings into the source, before the listification, is sub-optimal. We should keep the short id;s and insert them in a postprocessing pass of the list. Thay we keep the constructed objects, don't shift long character sequences around the memory, ... Guess I'll have to time things.

clex.tcl

 # -*- tcl -*-
 # Lexing C

 package provide clex 1.0
 namespace eval  clex {}

 # Most tokens can be recognized and separated from the others via
 # [string map].  The remainder are identifiers.  Except for strings,
 # and comments. These may contain any other token, and we may not act
 # upon them. And both may contain sequences reminiscent of the
 # other. Because of that a first step is to identify and parse out
 # comments and strings, where 'parsing out' means that these tokens
 # are replaced with special char sequences which refer the lexer to a
 # small database. In the final output they are placed back.

 proc clex::SCextract {code mapvar} {
     # Extract strings and comments from the code and place them in mapvar.
     # Replace them with ids

     upvar $mapvar map

     set tmp   ""
     set count 0

     while {1} {
         set comment [string first "/*" $code]
         set string  [string first "\"" $code]

         if {($comment < 0) && ($string < 0)} {
             # No comments nor strings to extract anymore.
             break
         }

         # Debug out
         #puts "$string | $comment | [string length $code]"

        if {
            (($string >= 0) && ($comment >= 0) && ($string < $comment)) ||
            (($string >= 0) && ($comment < 0))
        } {
            # The next vari-sized thing is a "-quoted string.
            # Finding its end is bit more difficult, because we have
            # to accept \" as one character inside of the string.

            set from $string
            while 1 {
                incr from
                set stop  [string first "\""   $code $from]
                set stopb [string first "\\\"" $code $from]
                incr stopb
                if {$stop == $stopb} {set from $stopb ; incr from ; continue}
                break
            }

            set id \000@$count\000
            incr count
            lappend map $id [set sub [string range $code $string $stop]]

            incr stop ; incr string -1
            append tmp [string range $code 0 $string] $id
            set  code  [string range $code $stop end]

            # Debug out
            #puts "\tSTR $id <$sub>"
            continue
        }

        if {
            (($string >= 0) && ($comment >= 0) && ($comment < $string)) ||
            (($comment >= 0) && ($string < 0))
        } {
            # The next vari-sized thing is a comment.
            # We ignore comments.

            set  stop [string first "*/" $code $comment]
            incr stop 2
            incr comment -1
            append tmp [string range $code 0 $comment]
            set  code  [string range $code $stop  end]
            continue
        }
        return -code error "Panic, string and comment at some location"
    }
    append  tmp $code
    return $tmp
 }

 proc clex::DefStart {} {
    variable tokens [list]
    #puts "== <$tokens>"
    return
 }

 proc clex::Key {string {replacement {}}} {
    variable tokens
    if {$replacement == {}} {
        set replacement \000\001[string toupper $string]\000
    } else {
        set replacement \000\001$replacement\000
    }
    lappend tokens $string $replacement
    #puts "== <$tokens>"
    return
 }

 proc clex::DefEnd {} {
    variable       tokens
    array set tmp $tokens
    set res [list]
    foreach key [lsort -decreasing [array names tmp]] {
        lappend res $key $tmp($key)
    }
    set tokens $res
    #puts "== <$tokens>"
    return
 }

 proc clex::lex {code} {
    variable tokens

    # Phase I ... Extract strings and comments so that they don't interfere
    #             with the remaining phases.

    # Phase II ... Separate all constant-sized tokens (keywords and
    #              punctuation) from each other.

    # Phase III ... Separate whitespace from the useful text.
    #               Actually converts whitespace into separator characters.

    # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences

    set scmap [list]
    if 0 {
        # Minimal number of commands for all phases

        regsub -all -- "\[\t\n \]+" [string map $tokens \
                [SCextract $code scmap]] \
                \000 tmp

        set code [split \
                [string trim \
                [string map "\000\000\000 \000 \000\000 \000" \
                [string map $scmap \
                $tmp]] \000] \000]
    }
    if 1 {
        # Each phase spelled out explicitly ...

        set code [SCextract          $code scmap]    ; # I
        set code [string map $tokens $code]          ; # II
        regsub -all -- "\[\t\n \]+"  $code \000 code ; # III
        set code [string map $scmap  $code]          ; # IV/a
        set code [string map "\000\000\000 \000 \000\000 \000" $code]  ; # IV/b
        set code [string trim $code \000]
        set code [split $code \000]
    }

    # Run through the list and create something useable by the parser.
    #
    # A list of pairs (pairs being lists of 2 elements), where each
    # pair contains the symbol to give to the parser, and associated
    # data, if any.

    set tmp [list]
    foreach lex $code {
        switch -glob -- [string index $lex 0] {
            \001 {
                # Keyword, no data.
                lappend tmp [list [string range $lex 1 end] {}]
            }
            ' - [0-9] {
                # Character or numeric constant.
                lappend tmp [list CONSTANT $lex]
            }
            \" {
                # String literal. Strip the double-quotes.
                lappend tmp [list STRING_LITERAL [string range $lex 1 end-1]]
            }
            default {
                # Identifier. This code does not distinguish
                # identifiers and type-names yet. This is defered to
                # the 'scanner', i.e. the glue code feeding the lexer
                # symbols into the parser.

                lappend tmp [list IDENTIFIER $lex]
            }
        }
    }
    set code $tmp

    return $code
 }

 namespace eval clex {
    DefStart

    Key (   LPAREN      ; Key )  RPAREN    ; Key ->  DEREF
    Key <   LT          ; Key <= LE        ; Key ==  EQ
    Key >   GT          ; Key >= GE        ; Key !=  NE
    Key \[  LBRACKET    ; Key \] RBRACKET  ; Key =   ASSIGN
    Key \{  LBRACE      ; Key \} RBRACE    ; Key *=  MUL_ASSIGN
    Key .   DOT         ; Key ,  COMMA     ; Key /=  DIV_ASSIGN
    Key ++  INCR_OP     ; Key -- DECR_OP   ; Key %=  REM_ASSIGN
    Key &   ADDR_BITAND ; Key *  MULT_STAR ; Key +=  PLUS_ASSIGN
    Key +   PLUS        ; Key -  MINUS     ; Key -=  MINUS_ASSIGN
    Key ~   BITNOT      ; Key !  LOGNOT    ; Key <<= LSHIFT_ASSIGN
    Key /   DIV         ; Key %  REM       ; Key >>= RSHIFT_ASSIGN
    Key <<  LSHIFT      ; Key >> RSHIFT    ; Key &=  BITAND_ASSIGN
    Key ^   BITEOR      ; Key && LOGAND    ; Key ^=  BITEOR_ASSIGN
    Key |   BITOR       ; Key || LOGOR     ; Key |=  BITOR_ASSIGN  
    Key ?   QUERY       ; Key :  COLON     ; Key \;  SEMICOLON
    Key ... ELLIPSIS

    Key typedef ; Key extern   ; Key static ; Key auto ; Key register
    Key void    ; Key char     ; Key short  ; Key int  ; Key long
    Key float   ; Key double   ; Key signed ; Key unsigned
    Key goto    ; Key continue ; Key break  ; Key return
    Key case    ; Key default  ; Key switch
    Key struct  ; Key union    ; Key enum
    Key while   ; Key do       ; Key for
    Key const   ; Key volatile
    Key if      ; Key else
    Key sizeof

    DefEnd
 }

cc.tcl

 # -*- tcl -*-
 # C parser.

 lappend auto_path [file join [file dirname [info script]]]

 package require yeti

 set pg [yeti::yeti \#auto -name CParser]

 ## Notes
 #
 ## The lexem TYPE_NAME is from the pure lexers point of view the same
 ## as IDENTIFIER, i.e. both have the same lexer definition. The
 ## information to distinguish between both of them comes from the
 ## parser stage, or rather from the symbol table it manages. This
 ## means that the lexer will lookup any IDENTIFIER in the symbol table
 ## and if that identifier is associated with a type it will return the
 ## symbol TYPE_NAME instead of IDENTIFIER.
 #
 ## Why this distinction ?
 #
 ## Because the C grammar has an inherent ambiguity in several places
 ## where a shift/reduce parser will be unable to select the correct
 ## reduction if no distinction is made between identifiers
 ## representing type names and all other identifiers.
 #
 ## The relevant pairs of LR(0) items are listed below. No distinction
 ## is possible if only IDENTIFIER is used in the grammar.
 #
 ##    type_specifier     --> TYPE_NAME  .
 ##    primary_expression --> IDENTIFIER .
 #
 ##    type_specifier     --> TYPE_NAME  .
 ##    identifier_list    --> IDENTIFIER .
 #
 ##    type_specifier     --> TYPE_NAME  .
 ##    direct_declarator  --> IDENTIFIER .

 $pg configure -start translation_unit -verbose 2
 $pg add {
    primary_expression {IDENTIFIER}               {}
    primary_expression {CONSTANT}                 {}
    primary_expression {STRING_LITERAL}           {}
    primary_expression {LPAREN expression RPAREN} {}

    postfix_expression {primary_expression} {}
    postfix_expression {postfix_expression LBRACKET expression RBRACKET} {}
    postfix_expression {postfix_expression LPAREN RPAREN} {}
    postfix_expression {postfix_expression LPAREN argument_expression_list RPAREN} {}
    postfix_expression {postfix_expression DOT IDENTIFIER} {}
    postfix_expression {postfix_expression DEREF IDENTIFIER} {}
    postfix_expression {postfix_expression INCR_OP} {}
    postfix_expression {postfix_expression DECR_OP} {}

    argument_expression_list {assignment_expression}                                {}
    argument_expression_list {argument_expression_list COMMA assignment_expression} {}

    unary_expression {postfix_expression}             {}
    unary_expression {INCR_OP unary_expression}       {}
    unary_expression {DECR_OP unary_expression}       {}
    unary_expression {unary_operator cast_expression} {}
    unary_expression {SIZEOF unary_expression}        {}
    unary_expression {SIZEOF LPAREN type_name RPAREN} {}

    unary_operator {ADDR_BITAND} {}
    unary_operator {STAR} {}
    unary_operator {PLUS} {}
    unary_operator {MINUS} {}
    unary_operator {BITNOT} {}
    unary_operator {LOGNOT} {}

    cast_expression {unary_expression} {}
    cast_expression {LPAREN type_name RPAREN cast_expression} {}

    multiplicative_expression {cast_expression} {}
    multiplicative_expression {multiplicative_expression MULT_STAR cast_expression} {}
    multiplicative_expression {multiplicative_expression DIV cast_expression} {}
    multiplicative_expression {multiplicative_expression REM cast_expression} {}

    additive_expression {multiplicative_expression} {}
    additive_expression {additive_expression PLUS multiplicative_expression} {}
    additive_expression {additive_expression MINUS multiplicative_expression} {}

    shift_expression {additive_expression} {}
    shift_expression {shift_expression LSHIFT additive_expression} {}
    shift_expression {shift_expression RSHIFT additive_expression} {}

    relational_expression {shift_expression} {}
    relational_expression {relational_expression LT shift_expression} {}
    relational_expression {relational_expression GT shift_expression} {}
    relational_expression {relational_expression LE shift_expression} {}
    relational_expression {relational_expression GE shift_expression} {}

    equality_expression {relational_expression} {}
    equality_expression {equality_expression EQ relational_expression} {}
    equality_expression {equality_expression NE relational_expression} {}

    and_expression {equality_expression} {}
    and_expression {and_expression ADDR_BITAND equality_expression} {}

    exclusive_or_expression {and_expression} {}
    exclusive_or_expression {exclusive_or_expression BITEOR and_expression} {}

    inclusive_or_expression {exclusive_or_expression} {}
    inclusive_or_expression {inclusive_or_expression BITOR exclusive_or_expression} {}

    logical_and_expression {inclusive_or_expression} {}
    logical_and_expression {logical_and_expression LOGAND inclusive_or_expression} {}

    logical_or_expression {logical_and_expression} {}
    logical_or_expression {logical_or_expression LOGOR logical_and_expression} {}

    conditional_expression {logical_or_expression} {}
    conditional_expression {logical_or_expression QUERY expression COLON conditional_expression} {}

    assignment_expression {conditional_expression} {}
    assignment_expression {unary_expression assignment_operator assignment_expression} {}

    assignment_operator {ASSIGN} {}
    assignment_operator {MUL_ASSIGN} {}
    assignment_operator {DIV_ASSIGN} {}
    assignment_operator {REM_ASSIGN} {}
    assignment_operator {PLUS_ASSIGN} {}
    assignment_operator {MINUS_ASSIGN} {}
    assignment_operator {LSHIFT_ASSIGN} {}
    assignment_operator {RSHIFT_ASSIGN} {}
    assignment_operator {BITAND_ASSIGN} {}
    assignment_operator {BITEOR_ASSIGN} {}
    assignment_operator {BITOR_ASSIGN} {}

    expression {assignment_expression} {}
    expression {expression COMMA assignment_expression} {}

    constant_expression {conditional_expression} {}

    declaration {declaration_specifiers SEMICOLON} {}
    declaration {declaration_specifiers init_declarator_list SEMICOLON} {}

    declaration_specifiers {storage_class_specifier} {}
    declaration_specifiers {storage_class_specifier declaration_specifiers} {}
    declaration_specifiers {type_specifier} {}
    declaration_specifiers {type_specifier declaration_specifiers} {}
    declaration_specifiers {type_qualifier} {}
    declaration_specifiers {type_qualifier declaration_specifiers} {}

    init_declarator_list {init_declarator} {}
    init_declarator_list {init_declarator_list COMMA init_declarator} {}

    init_declarator {declarator} {}
    init_declarator {declarator ASSIGN initializer} {}

    storage_class_specifier {TYPEDEF} {}
    storage_class_specifier {EXTERN} {}
    storage_class_specifier {STATIC} {}
    storage_class_specifier {AUTO} {}
    storage_class_specifier {REGISTER} {}

    type_specifier {VOID} {}
    type_specifier {CHAR} {}
    type_specifier {SHORT} {}
    type_specifier {INT} {}
    type_specifier {LONG} {}
    type_specifier {FLOAT} {}
    type_specifier {DOUBLE} {}
    type_specifier {SIGNED} {}
    type_specifier {UNSIGNED} {}
    type_specifier {struct_or_union_specifier} {}
    type_specifier {enum_specifier} {}
    type_specifier {TYPE_NAME} {}

    struct_or_union_specifier {struct_or_union IDENTIFIER LBRACE struct_declaration_list RBRACE} {}
    struct_or_union_specifier {struct_or_union LBRACE struct_declaration_list RBRACE} {}
    struct_or_union_specifier {struct_or_union IDENTIFIER} {}

    struct_or_union {STRUCT} {}
    struct_or_union {UNION} {}

    struct_declaration_list {struct_declaration} {}
    struct_declaration_list {struct_declaration_list struct_declaration} {}

    struct_declaration {specifier_qualifier_list struct_declarator_list SEMICOLON} {}

    specifier_qualifier_list {type_specifier specifier_qualifier_list} {}
    specifier_qualifier_list {type_specifier} {}
    specifier_qualifier_list {type_qualifier specifier_qualifier_list} {}
    specifier_qualifier_list {type_qualifier} {}

    struct_declarator_list {struct_declarator} {}
    struct_declarator_list {struct_declarator_list COMMA struct_declarator} {}

    struct_declarator {declarator} {}
    struct_declarator {COLON constant_expression} {}
    struct_declarator {declarator COLON constant_expression} {}

    enum_specifier {ENUM LBRACE enumerator_list RBRACE} {}
    enum_specifier {ENUM IDENTIFIER LBRACE enumerator_list RBRACE} {}
    enum_specifier {ENUM IDENTIFIER} {}

    enumerator_list {enumerator} {}
    enumerator_list {enumerator_list COMMA enumerator} {}

    enumerator {IDENTIFIER} {}
    enumerator {IDENTIFIER ASSIGN constant_expression} {}

    type_qualifier {CONST} {}
    type_qualifier {VOLATILE} {}

    declarator {pointer direct_declarator} {}
    declarator {direct_declarator} {}

    direct_declarator {IDENTIFIER} {}
    direct_declarator {LPAREN declarator RPAREN} {}
    direct_declarator {direct_declarator LBRACKET constant_expression RBRACKET} {}
    direct_declarator {direct_declarator LBRACKET RBRACKET} {}
    direct_declarator {direct_declarator LPAREN parameter_type_list RPAREN} {}
    direct_declarator {direct_declarator LPAREN identifier_list RPAREN} {}
    direct_declarator {direct_declarator LPAREN RPAREN} {}

    pointer {MULT_STAR} {}
    pointer {MULT_STAR type_qualifier_list} {}
    pointer {MULT_STAR pointer} {}
    pointer {MULT_STAR type_qualifier_list pointer} {}

    type_qualifier_list {type_qualifier} {}
    type_qualifier_list {type_qualifier_list type_qualifier} {}

    parameter_type_list {parameter_list} {}
    parameter_type_list {parameter_list COMMA ELLIPSIS} {}

    parameter_list {parameter_declaration} {}
    parameter_list {parameter_list COMMA parameter_declaration} {}

    parameter_declaration {declaration_specifiers declarator} {}
    parameter_declaration {declaration_specifiers abstract_declarator} {}
    parameter_declaration {declaration_specifiers} {}

    identifier_list {IDENTIFIER} {}
    identifier_list {identifier_list COMMA IDENTIFIER} {}

    type_name {specifier_qualifier_list} {}
    type_name {specifier_qualifier_list abstract_declarator} {}

    abstract_declarator {pointer} {}
    abstract_declarator {direct_abstract_declarator} {}
    abstract_declarator {pointer direct_abstract_declarator} {}

    direct_abstract_declarator {LPAREN abstract_declarator RPAREN} {}
    direct_abstract_declarator {LBRACKET RBRACKET} {}
    direct_abstract_declarator {LBRACKET constant_expression RBRACKET} {}
    direct_abstract_declarator {direct_abstract_declarator LBRACKET RBRACKET} {}
    direct_abstract_declarator {direct_abstract_declarator LBRACKET constant_expression RBRACKET} {}
    direct_abstract_declarator {LPAREN RPAREN} {}
    direct_abstract_declarator {LPAREN parameter_type_list RPAREN} {}
    direct_abstract_declarator {direct_abstract_declarator LPAREN RPAREN} {}
    direct_abstract_declarator {direct_abstract_declarator LPAREN parameter_type_list RPAREN} {}

    initializer {assignment_expression} {}
    initializer {LBRACE initializer_list RBRACE} {}
    initializer {LBRACE initializer_list COMMA RBRACE} {}

    initializer_list {initializer} {}
    initializer_list {initializer_list COMMA initializer} {}

    statement {labeled_statement} {}
    statement {compound_statement} {}
    statement {expression_statement} {}
    statement {selection_statement} {}
    statement {iteration_statement} {}
    statement {jump_statement} {}

    labeled_statement {IDENTIFIER COLON statement} {}
    labeled_statement {CASE constant_expression COLON statement} {}
    labeled_statement {DEFAULT COLON statement} {}

    compound_statement {LBRACE RBRACE} {}
    compound_statement {LBRACE statement_list RBRACE} {}
    compound_statement {LBRACE declaration_list RBRACE} {}
    compound_statement {LBRACE declaration_list statement_list RBRACE} {}

    declaration_list {declaration} {}
    declaration_list {declaration_list declaration} {}

    statement_list {statement} {}
    statement_list {statement_list statement} {}

    expression_statement {SEMICOLON} {}
    expression_statement {expression SEMICOLON} {}

    selection_statement {IF LPAREN expression RPAREN statement} {}
    selection_statement {IF LPAREN expression RPAREN statement ELSE statement} {}
    selection_statement {SWITCH LPAREN expression RPAREN statement} {}

    iteration_statement {WHILE LPAREN expression RPAREN statement} {}
    iteration_statement {DO statement WHILE LPAREN expression RPAREN SEMICOLON} {}
    iteration_statement {FOR LPAREN expression_statement expression_statement RPAREN statement} {}
    iteration_statement {FOR LPAREN expression_statement expression_statement expression RPAREN statement} {}

    jump_statement {GOTO IDENTIFIER SEMICOLON} {}
    jump_statement {CONTINUE SEMICOLON} {}
    jump_statement {BREAK SEMICOLON} {}
    jump_statement {RETURN SEMICOLON} {}
    jump_statement {RETURN expression SEMICOLON} {}

    translation_unit {external_declaration} {}
    translation_unit {translation_unit external_declaration} {}

    external_declaration {function_definition} {}
    external_declaration {declaration} {}

    function_definition {declaration_specifiers declarator declaration_list compound_statement} {}
    function_definition {declaration_specifiers declarator compound_statement} {}
    function_definition {declarator declaration_list compound_statement} {}
    function_definition {declarator compound_statement} {}
 }

 # Dump the generated parser.

 set    f [open cparser.tcl w]
 puts  $f [$pg dump]
 close $f
 delete object $pg

 # Done.
 exit

[ Lexing C ] <-- [ Scripted Compiler ]

GPS: This is an interesting project. What's next? Is a project page available for this somewhere(SF?)?

Currently I am toying a bit with Yeti, to make it more flexible and give me more control over the generated code. From there I would add code here to generate the parse tree of the code, and of course, management of the symbol table so that lexer and parser actually work together (see notes).

Beyond that ? I don't know yet. Things like conversion of the parse tree to a more abstract tree, type checking, ...

Way later: code generation.

SF - No, there is no project on SF about this. Well, the nearest is David Cuthbert's kt2c. This translates tcl procedures into equivalent C.