[ Scripted Compiler :: Parsing C :: --> Lexing C ]
Based upon Scripted Parsing and Lexing C this page introduces the main components of a parser for the C language. This page is part of the Scripted Compiler project.
Beyond the lexer you will find a grammar definition for C usable by Yeti. This definition is based upon [L1 ]. Running it through Yeti will generate a LALR(1) shift/reduce parser for C.
Not shown on this page:
cc.tcl
#!/usr/bin/env tclsh # -*- tcl -*- # C parser. lappend auto_path [file join [file dirname [info script]] ..] package require yeti set pg [yeti::yeti \#auto -name CParser] ## Notes # ## The lexeme TYPE_NAME is from the pure lexers point of view the same ## as IDENTIFIER, i.e. both have the same lexer definition. The ## information to distinguish between both of them comes from the ## parser stage, or rather from the symbol table it manages. This ## means that the lexer will lookup any IDENTIFIER in the symbol table ## and if that identifier is associated with a type it will return the ## symbol TYPE_NAME instead of IDENTIFIER. # ## Why this distinction ? # ## Because the C grammar has an inherent ambiguity in several places ## where a shift/reduce parser will be unable to select the correct ## reduction if no distinction is made between identifiers ## representing type names and all other identifiers. # ## The relevant pairs of LR(0) items are listed below. No distinction ## is possible if only IDENTIFIER is used in the grammar. # ## type_specifier --> TYPE_NAME . ## primary_expression --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## identifier_list --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## direct_declarator --> IDENTIFIER . $pg configure -start translation_unit -verbose 2 $pg add { primary_expression {IDENTIFIER} - | {CONSTANT} - | {STRING_LITERAL} - | {LPAREN expression RPAREN} - postfix_expression {primary_expression} - | {postfix_expression LBRACKET expression RBRACKET} - | {postfix_expression LPAREN RPAREN} - | {postfix_expression LPAREN argument_expression_list RPAREN} - | {postfix_expression DOT IDENTIFIER} - | {postfix_expression DEREF IDENTIFIER} - | {postfix_expression INCR_OP} - | {postfix_expression DECR_OP} - argument_expression_list {assignment_expression} - | {argument_expression_list COMMA assignment_expression} - unary_expression {postfix_expression} - | {INCR_OP unary_expression} - | {DECR_OP unary_expression} - | {unary_operator cast_expression} - | {SIZEOF unary_expression} - | {SIZEOF LPAREN type_name RPAREN} - unary_operator {ADDR_BITAND} - | {MULT_STAR} - | {PLUS} - | {MINUS} - | {BITNOT} - | {LOGNOT} - cast_expression {unary_expression} - | {LPAREN type_name RPAREN cast_expression} - multiplicative_expression {cast_expression} - | {multiplicative_expression MULT_STAR cast_expression} - | {multiplicative_expression DIV cast_expression} - | {multiplicative_expression REM cast_expression} - additive_expression {multiplicative_expression} - | {additive_expression PLUS multiplicative_expression} - | {additive_expression MINUS multiplicative_expression} - shift_expression {additive_expression} - | {shift_expression LSHIFT additive_expression} - | {shift_expression RSHIFT additive_expression} - relational_expression {shift_expression} - | {relational_expression LT shift_expression} - | {relational_expression GT shift_expression} - | {relational_expression LE shift_expression} - | {relational_expression GE shift_expression} - equality_expression {relational_expression} - | {equality_expression EQ relational_expression} - | {equality_expression NE relational_expression} - and_expression {equality_expression} - | {and_expression ADDR_BITAND equality_expression} - exclusive_or_expression {and_expression} - | {exclusive_or_expression BITEOR and_expression} - inclusive_or_expression {exclusive_or_expression} - | {inclusive_or_expression BITOR exclusive_or_expression} - logical_and_expression {inclusive_or_expression} - | {logical_and_expression LOGAND inclusive_or_expression} - logical_or_expression {logical_and_expression} - | {logical_or_expression LOGOR logical_and_expression} - conditional_expression {logical_or_expression} - | {logical_or_expression QUERY expression COLON conditional_expression} - assignment_expression {conditional_expression} - | {unary_expression assignment_operator assignment_expression} - assignment_operator {ASSIGN} - | {MUL_ASSIGN} - | {DIV_ASSIGN} - | {REM_ASSIGN} - | {PLUS_ASSIGN} - | {MINUS_ASSIGN} - | {LSHIFT_ASSIGN} - | {RSHIFT_ASSIGN} - | {BITAND_ASSIGN} - | {BITEOR_ASSIGN} - | {BITOR_ASSIGN} - expression {assignment_expression} - | {expression COMMA assignment_expression} - constant_expression {conditional_expression} - declaration {declaration_specifiers SEMICOLON} - | {declaration_specifiers init_declarator_list SEMICOLON} - declaration_specifiers {storage_class_specifier} - | {storage_class_specifier declaration_specifiers} - | {type_specifier} - | {type_specifier declaration_specifiers} - | {type_qualifier} - | {type_qualifier declaration_specifiers} - init_declarator_list {init_declarator} - | {init_declarator_list COMMA init_declarator} - init_declarator {declarator} - | {declarator ASSIGN initializer} - storage_class_specifier {TYPEDEF} - | {EXTERN} - | {STATIC} - | {AUTO} - | {REGISTER} - type_specifier {VOID} - | {CHAR} - | {SHORT} - | {INT} - | {LONG} - | {FLOAT} - | {DOUBLE} - | {SIGNED} - | {UNSIGNED} - | {struct_or_union_specifier} - | {enum_specifier} - | {TYPE_NAME} - struct_or_union_specifier {struct_or_union IDENTIFIER LBRACE struct_declaration_list RBRACE} - | {struct_or_union LBRACE struct_declaration_list RBRACE} - | {struct_or_union IDENTIFIER} - struct_or_union {STRUCT} - | {UNION} - struct_declaration_list {struct_declaration} - | {struct_declaration_list struct_declaration} - struct_declaration {specifier_qualifier_list struct_declarator_list SEMICOLON} - specifier_qualifier_list {type_specifier specifier_qualifier_list} - | {type_specifier} - | {type_qualifier specifier_qualifier_list} - | {type_qualifier} - struct_declarator_list {struct_declarator} - | {struct_declarator_list COMMA struct_declarator} - struct_declarator {declarator} - | {COLON constant_expression} - | {declarator COLON constant_expression} - enum_specifier {ENUM LBRACE enumerator_list RBRACE} - | {ENUM LBRACE enumerator_list COMMA RBRACE} - | {ENUM IDENTIFIER LBRACE enumerator_list RBRACE} - | {ENUM IDENTIFIER LBRACE enumerator_list COMMA RBRACE} - | {ENUM IDENTIFIER} - enumerator_list {enumerator} - | {enumerator_list COMMA enumerator} - enumerator {IDENTIFIER} - | {IDENTIFIER ASSIGN constant_expression} - type_qualifier {CONST} - | {VOLATILE} - declarator {pointer direct_declarator} - | {direct_declarator} - direct_declarator {IDENTIFIER} - | {LPAREN declarator RPAREN} - | {direct_declarator LBRACKET constant_expression RBRACKET} - | {direct_declarator LBRACKET RBRACKET} - | {direct_declarator LPAREN parameter_type_list RPAREN} - | {direct_declarator LPAREN identifier_list RPAREN} - | {direct_declarator LPAREN RPAREN} - pointer {MULT_STAR} - | {MULT_STAR type_qualifier_list} - | {MULT_STAR pointer} - | {MULT_STAR type_qualifier_list pointer} - type_qualifier_list {type_qualifier} - | {type_qualifier_list type_qualifier} - parameter_type_list {parameter_list} - | {parameter_list COMMA ELLIPSIS} - parameter_list {parameter_declaration} - | {parameter_list COMMA parameter_declaration} - parameter_declaration {declaration_specifiers declarator} - | {declaration_specifiers abstract_declarator} - | {declaration_specifiers} - identifier_list {IDENTIFIER} - | {identifier_list COMMA IDENTIFIER} - type_name {specifier_qualifier_list} - | {specifier_qualifier_list abstract_declarator} - abstract_declarator {pointer} - | {direct_abstract_declarator} - | {pointer direct_abstract_declarator} - direct_abstract_declarator {LPAREN abstract_declarator RPAREN} - | {LBRACKET RBRACKET} - | {LBRACKET constant_expression RBRACKET} - | {direct_abstract_declarator LBRACKET RBRACKET} - | {direct_abstract_declarator LBRACKET constant_expression RBRACKET} - | {LPAREN RPAREN} - | {LPAREN parameter_type_list RPAREN} - | {direct_abstract_declarator LPAREN RPAREN} - | {direct_abstract_declarator LPAREN parameter_type_list RPAREN} - initializer {assignment_expression} - | {LBRACE initializer_list RBRACE} - | {LBRACE initializer_list COMMA RBRACE} - initializer_list {initializer} - | {initializer_list COMMA initializer} - statement {labeled_statement} - | {compound_statement} - | {expression_statement} - | {selection_statement} - | {iteration_statement} - | {jump_statement} - labeled_statement {IDENTIFIER COLON statement} - | {CASE constant_expression COLON statement} - | {DEFAULT COLON statement} - compound_statement {LBRACE RBRACE} - | {LBRACE statement_list RBRACE} - | {LBRACE declaration_list RBRACE} - | {LBRACE declaration_list statement_list RBRACE} - declaration_list {declaration} - | {declaration_list declaration} - statement_list {statement} - | {statement_list statement} - expression_statement {SEMICOLON} - | {expression SEMICOLON} - selection_statement {IF LPAREN expression RPAREN statement} - | {IF LPAREN expression RPAREN statement ELSE statement} - | {SWITCH LPAREN expression RPAREN statement} - iteration_statement {WHILE LPAREN expression RPAREN statement} - | {DO statement WHILE LPAREN expression RPAREN SEMICOLON} - | {FOR LPAREN expression_statement expression_statement RPAREN statement} - | {FOR LPAREN expression_statement expression_statement expression RPAREN statement} - jump_statement {GOTO IDENTIFIER SEMICOLON} - | {CONTINUE SEMICOLON} - | {BREAK SEMICOLON} - | {RETURN SEMICOLON} - | {RETURN expression SEMICOLON} - translation_unit {external_declaration} - | {translation_unit external_declaration} - external_declaration {function_definition} - | {declaration} - function_definition {declaration_specifiers declarator declaration_list compound_statement} - | {declaration_specifiers declarator compound_statement} - | {declarator declaration_list compound_statement} - | {declarator compound_statement} - } # Dump the generated parser. set f [open cparser.tcl w] puts $f [$pg dump] close $f itcl::delete object $pg
Change Log
GPS: This is an interesting project. What's next? Is a project page available for this somewhere(SF?)?
Currently I am toying a bit with Yeti, to make it more flexible and give me more control over the generated code. From there I would add code here to generate the parse tree of the code, and of course, management of the symbol table so that lexer and parser actually work together (see notes).
Beyond that ? I don't know yet. Things like conversion of the parse tree to a more abstract tree, type checking, ...
Way later: code generation.
SF - No, there is no project on SF about this. Well, the nearest is David Cuthbert's kt2c. This translates tcl procedures into equivalent C.
(Helmut Giese, 2004-06-30) There seemed to be a gap between the lexer from Lexing C and the parser above: As they stand these two don't fit to each other. Below is a small adapter class, which makes them compatible and a driver to combine all the pieces.
Please note, that the current version of Yeti (0.4) requires that at least one rule of the grammar has a non-empty script associated with it (but a comment suffices).
File lexadapter.tcl
# # An adapter to pass the output generated by clex.tcl to the (yeti based) # parser generated by cc.tcl from above. # package require Itcl foreach classDef {LexAdapter} { if { [info command $classDef] != "" } { catch {itcl::delete class $classDef} } } itcl::class LexAdapter { public variable _currIdx 0 public variable _sym {} ;# list of symbols public variable _aIdxLst {} ;# list of attribute indices public variable _aValLst {} ;# list of corresponding attribute values public variable _aIdx 0 constructor {data} { foreach {_sym attr} $data {break} foreach {_aIdxLst _aValLst} $attr {break} } # # Return the next symbol. If the symbol has a value (an attribute in the # notation of clex) return it, too. This case is signaled by clex by # appending a '-' to the symbol's name. # Note: The scanner recognises comments, but we don't want to pass those # to the parser. Hence 'next' is structured as a 'while' loop, which will # 'continue' on a comment and 'break' in all other cases. # method next {} { while 1 { if { $_currIdx >= [llength $_sym] } { set res [list] } else { set res [lindex $_sym $_currIdx] incr _currIdx if { [string index $res end] eq "-" } { # need to handle an "attribute" set attrIdx [lindex $_aIdxLst $_aIdx] incr _aIdx if { $res eq "COMMENT-" } { continue ;# parser doesn't want comments } if { $res eq "IDENT-" } { set res IDENTIFIER- ;# parser uses different name } # remove '-' set res [string range $res 0 end-1] lappend res [lindex $_aValLst $attrIdx] } } break } return $res } }
The driver. For the moment it is probably easiest to have all needed files in one directory (clex.tcl from Lexing C, lexbase.tcl from Scripted Lexing plus the files from this page).
#!/usr/bin/env tclsh # # First steps parsing C # # This version uses the clex lexer and the adapter from lexadapter.tcl # package require Itcl ;# on behalf of cparser.tcl # don't have a pkgIndex.tcl for this one, but clex wants to [package require] it source lexbase.tcl # the lexer and parser used here source clex.tcl source cparser.tcl # the adapter: 'lexer output -> yeti's expectations' source lexadapter.tcl # what to parse ? if {![llength $argv]} { set fName mini.c } else { set fName [lindex $argv 0] } puts "parsing $fName" set input [read [set fh [open $fName]]][close $fh] # Strip preprocessor regsub -all {(?w)^[ \t\v\f\r]*#(?:\\.|[^\n\\])*$} $input {} input set code [lexbase::lex $input] # # Instantiate scanner and parser # set scanner [LexAdapter #auto $code] #$scanner configure -verbose 1 set cp [CParser #auto -scanner $scanner] #$cp configure -verbose 1 $cp reset # all this effort to be able to issue _this_ command $cp parse
Note that in order to simply drop the current yytext, typically for comments, so you do not need to take them in account COMMENT tokens in the grammar, one can simply call the next method.
$sg add -nocase { {\{} { return LACCO } {\}} { return RACCO } {;} { return SCOL } [0-9]+ { return [list INT $yytext] } [0-9]+{\.}[0-9]* { return [list FLOAT $yytext] } {//.*?\n} { return [next] } }
In this example, all the characters after // are simply skipped
AMG: Here's a parser that goes with the alternative, ylex-based scanner found on Lexing C.
It's just a skeleton, but it does attempt to handle typedefs. typedef is the one thing preventing C from being context-free; therefore even the simplest C parser needs to implement it so it can tell the difference between an IDENTIFIER and a TYPE_NAME.
This basic implementation doesn't properly limit the scope of typedefs, so a typedef in one function will be valid in the next. Getting this right requires more smarts and may be difficult to implement in isolation. In other words, much of the necessary machinery would also be needed by a proper compiler implementation.
# cparser.tcl package require yeti # Create the object used to assemble the parser. yeti::yeti CParserFactory -name CParser -start translation_unit # Define internal variables and methods. CParserFactory code private { variable typedef 0 ;# Nonzero while constructing a typedef. variable identifier {} ;# Current identifier name. # handleInitDeclarator -- # Handler for init_declarator. Registers typedefs. method handleInitDeclarator {} { if {$typedef} { $scanner addTypeName $identifier } set identifier {} } } # On error, print the filename, line number, and column number. CParserFactory code error { if {[set file [$scanner cget -file]] ne {}} { puts -nonewline $verbout $file: } puts $verbout "[$scanner cget -line]:[$scanner cget -column]: $yyerrmsg" } # Define the grammar. CParserFactory add { optional_comma {} - | {,} - primary_expression {IDENTIFIER} - | {CONSTANT} - | {STRING_LITERAL} - | {( expression )} - postfix_expression {primary_expression} - | {postfix_expression [ expression ]} - | {postfix_expression ( )} - | {postfix_expression ( argument_expression_list )} - | {postfix_expression . IDENTIFIER} - | {postfix_expression -> IDENTIFIER} - | {postfix_expression ++} - | {postfix_expression --} - argument_expression_list {assignment_expression} - | {argument_expression_list , assignment_expression} - unary_expression {postfix_expression} - | {++ unary_expression} - | {-- unary_expression} - | {unary_operator cast_expression} - | {SIZEOF unary_expression} - | {SIZEOF ( type_name )} - unary_operator {&} - | {*} - | {+} - | {-} - | {!} - | {~} - cast_expression {unary_expression} - | {( type_name ) cast_expression} - multiplicative_expression {cast_expression} - | {multiplicative_expression * cast_expression} - | {multiplicative_expression / cast_expression} - | {multiplicative_expression % cast_expression} - additive_expression {multiplicative_expression} - | {additive_expression + multiplicative_expression} - | {additive_expression - multiplicative_expression} - shift_expression {additive_expression} - | {shift_expression << additive_expression} - | {shift_expression >> additive_expression} - relational_expression {shift_expression} - | {relational_expression < shift_expression} - | {relational_expression > shift_expression} - | {relational_expression <= shift_expression} - | {relational_expression >= shift_expression} - equality_expression {relational_expression} - | {equality_expression == relational_expression} - | {equality_expression != relational_expression} - and_expression {equality_expression} - | {and_expression & equality_expression} - exclusive_or_expression {and_expression} - | {exclusive_or_expression ^ and_expression} - inclusive_or_expression {exclusive_or_expression} - | {inclusive_or_expression | exclusive_or_expression} - logical_and_expression {inclusive_or_expression} - | {logical_and_expression && inclusive_or_expression} - logical_or_expression {logical_and_expression} - | {logical_or_expression || logical_and_expression} - conditional_expression {logical_or_expression} - | {logical_or_expression ? expression : conditional_expression} - assignment_expression {conditional_expression} - | {unary_expression assignment_operator assignment_expression} - assignment_operator {=} - | {*=} - | {/=} - | {%=} - | {+=} - | {-=} - | {<<=} - | {>>=} - | {&=} - | {^=} - | {|=} - expression {assignment_expression} - | {expression , assignment_expression} - constant_expression {conditional_expression} - declaration {declaration_specifiers ;} - | {declaration_specifiers init_declarator_list ;} - declaration_specifiers {storage_class_specifier} - | {storage_class_specifier declaration_specifiers} - | {type_specifier} - | {type_specifier declaration_specifiers} - | {type_qualifier} - | {type_qualifier declaration_specifiers} - init_declarator_list {init_declarator} - | {init_declarator_list , init_declarator} - init_declarator {declarator} {handleInitDeclarator} | {declarator = initializer} {handleInitDeclarator} storage_class_specifier {TYPEDEF} {set typedef 1} | {EXTERN} - | {STATIC} - | {AUTO} - | {REGISTER} - type_specifier {VOID} - | {CHAR} - | {SHORT} - | {INT} - | {LONG} - | {FLOAT} - | {DOUBLE} - | {SIGNED} - | {UNSIGNED} - | {struct_or_union_specifier} - | {enum_specifier} - | {TYPE_NAME} - struct_or_union_specifier {struct_or_union IDENTIFIER \{ struct_declaration_list \}} - | {struct_or_union \{ struct_declaration_list \}} - | {struct_or_union IDENTIFIER} - struct_or_union {STRUCT} - | {UNION} - struct_declaration_list {struct_declaration} - | {struct_declaration_list struct_declaration} - struct_declaration {specifier_qualifier_list struct_declarator_list ;} - specifier_qualifier_list {type_specifier specifier_qualifier_list} - | {type_specifier} - | {type_qualifier specifier_qualifier_list} - | {type_qualifier} - struct_declarator_list {struct_declarator} - | {struct_declarator_list , struct_declarator} - struct_declarator {declarator} - | {: constant_expression} - | {declarator : constant_expression} - enum_specifier {ENUM \{ enumerator_list optional_comma \}} - | {ENUM IDENTIFIER \{ enumerator_list optional_comma \}} - | {ENUM IDENTIFIER} - enumerator_list {enumerator} - | {enumerator_list , enumerator} - enumerator {IDENTIFIER} - | {IDENTIFIER = constant_expression} - type_qualifier {CONST} - | {VOLATILE} - declarator {pointer direct_declarator} - | {direct_declarator} - direct_declarator {IDENTIFIER} {set identifier $1} | {( declarator )} - | {direct_declarator [ constant_expression ]} - | {direct_declarator [ ]} - | {direct_declarator ( parameter_type_list )} - | {direct_declarator ( identifier_list )} - | {direct_declarator ( )} - pointer {*} - | {* type_qualifier_list} - | {* pointer} - | {* type_qualifier_list pointer} - type_qualifier_list {type_qualifier} - | {type_qualifier_list type_qualifier} - parameter_type_list {parameter_list} - | {parameter_list , ...} - parameter_list {parameter_declaration} - | {parameter_list , parameter_declaration} - parameter_declaration {declaration_specifiers declarator} - | {declaration_specifiers abstract_declarator} - | {declaration_specifiers} - identifier_list {IDENTIFIER} - | {identifier_list , IDENTIFIER} - type_name {specifier_qualifier_list} - | {specifier_qualifier_list abstract_declarator} - abstract_declarator {pointer} - | {direct_abstract_declarator} - | {pointer direct_abstract_declarator} - direct_abstract_declarator {( abstract_declarator )} - | {[ ]} - | {[ constant_expression ]} - | {direct_abstract_declarator [ ]} - | {direct_abstract_declarator [ constant_expression ]} - | {( )} - | {( parameter_type_list )} - | {direct_abstract_declarator ( )} - | {direct_abstract_declarator ( parameter_type_list )} - initializer {assignment_expression} - | {\{ initializer_list optional_comma \}} - initializer_list {initializer} - | {initializer_list , initializer} - statement {labeled_statement} - | {compound_statement} - | {expression_statement} - | {selection_statement} - | {iteration_statement} - | {jump_statement} - labeled_statement {IDENTIFIER : statement} - | {CASE constant_expression : statement} - | {DEFAULT : statement} - compound_statement {\{ \}} - | {\{ statement_list \}} - | {\{ declaration_list \}} - | {\{ declaration_list statement_list \}} - declaration_list {declaration} - | {declaration_list declaration} - statement_list {statement} - | {statement_list statement} - expression_statement {;} - | {expression ;} - selection_statement {IF ( expression ) statement} - | {IF ( expression ) statement ELSE statement} - | {SWITCH ( expression ) statement} - iteration_statement {WHILE ( expression ) statement} - | {DO statement WHILE ( expression ) ;} - | {FOR ( expression_statement expression_statement ) statement} - | {FOR ( expression_statement expression_statement expression ) statement} - jump_statement {GOTO IDENTIFIER ;} - | {CONTINUE ;} - | {BREAK ;} - | {RETURN ;} - | {RETURN expression ;} - translation_unit {external_declaration} {set typedef 0} | {translation_unit external_declaration} - external_declaration {function_definition} - | {declaration} - function_definition {declaration_specifiers declarator declaration_list compound_statement} - | {declaration_specifiers declarator compound_statement} - | {declarator declaration_list compound_statement} - | {declarator compound_statement} - } # Create the CParser class. #eval [CParserFactory dump] # Cache the generated script to avoid dependency on yeti and for speed. set chan [open cparser.out.tcl w] puts $chan [CParserFactory dump] close $chan itcl::delete object CParserFactory
And here's the driver program. You will need yeti, ylex, cscanner.tcl (from Lexing C), and cparser.tcl (above). If you have trouble with mismatched braces or such, it means you need to get the very latest version of yeti which allows braces in terminal names.
#!/usr/bin/env tclsh # Load scanner and parser scripts. package require Itcl ;# on behalf of cparser.out.tcl source cscanner.tcl if {![file exists cparser.out.tcl] || [file mtime cparser.tcl] > [file mtime cparser.out.tcl]} { source cparser.tcl } source cparser.out.tcl # Read input file and strip preprocessor. set chan [open [lindex $argv 0]] regsub -all {(?w)^[ \t\v\f\r]*#(?:\\.|[^\n\\])*$} [read $chan] {} input close $chan # Create scanner and give it the input. CScanner scanner scanner start $input # Create and run scanner. CParser parser -scanner scanner parser parse # vim: set sts=4 sw=4 tw=80 et ft=tcl:
AMG: The above code is more fully developed on the Parsing C Types page.
[ Scripted Compiler :: Parsing C :: --> Lexing C ]