[[ [Scripted Compiler] :: '''Parsing C''' :: --> [Lexing C] ]] ---- Based upon [Scripted Parsing] and [Lexing C] this page introduces the main components of a parser for the [C language]. This page is part of the [Scripted Compiler] project. Beyond the lexer you will find a grammar definition for C useable by [Yeti]. This definition is based upon [http://www.lysator.liu.se/c/ANSI-C-grammar-y.html]. Running it through Yeti will generate a LALR(1) shift/reduce parser for C. Not shown on this page: * A lexer for C. Go to [Lexing C] to get one. * The glue between the token list generated by the lexer, and the yeti-parser. This scanner has to perform one additional transformation, the conversion of identifiers for type names into the symbol TYPE_NAME. Why this has to be done is explained in the section defining the grammar. * The code to execute when rules are reduced by the parser. This code has to maintain at least a symbol-table, to support the lexer in his task, see above. Beyond that nearly anything will go. This could be a direct translation to machine code, the generation of an exact parse tree, generation of an abstract syntax tree, or whatever your imagination will come up with. * The generated parser. It comes in at 105 Kilobyte of Tcl code and is thus much too large for this page. And it can be generated anyways via [Yeti], so there is really no reason to place it here. ---- '''cc.tcl''' # -*- tcl -*- # C parser. lappend auto_path [file join [file dirname [info script]]] package require yeti set pg [yeti::yeti \#auto -name CParser] ## Notes # ## The lexem TYPE_NAME is from the pure lexers point of view the same ## as IDENTIFIER, i.e. both have the same lexer definition. The ## information to distinguish between both of them comes from the ## parser stage, or rather from the symbol table it manages. This ## means that the lexer will lookup any IDENTIFIER in the symbol table ## and if that identifier is associated with a type it will return the ## symbol TYPE_NAME instead of IDENTIFIER. # ## Why this distinction ? # ## Because the C grammar has an inherent ambiguity in several places ## where a shift/reduce parser will be unable to select the correct ## reduction if no distinction is made between identifiers ## representing type names and all other identifiers. # ## The relevant pairs of LR(0) items are listed below. No distinction ## is possible if only IDENTIFIER is used in the grammar. # ## type_specifier --> TYPE_NAME . ## primary_expression --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## identifier_list --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## direct_declarator --> IDENTIFIER . $pg configure -start translation_unit -verbose 2 $pg add { primary_expression {IDENTIFIER} {} primary_expression {CONSTANT} {} primary_expression {STRING_LITERAL} {} primary_expression {LPAREN expression RPAREN} {} postfix_expression {primary_expression} {} postfix_expression {postfix_expression LBRACKET expression RBRACKET} {} postfix_expression {postfix_expression LPAREN RPAREN} {} postfix_expression {postfix_expression LPAREN argument_expression_list RPAREN} {} postfix_expression {postfix_expression DOT IDENTIFIER} {} postfix_expression {postfix_expression DEREF IDENTIFIER} {} postfix_expression {postfix_expression INCR_OP} {} postfix_expression {postfix_expression DECR_OP} {} argument_expression_list {assignment_expression} {} argument_expression_list {argument_expression_list COMMA assignment_expression} {} unary_expression {postfix_expression} {} unary_expression {INCR_OP unary_expression} {} unary_expression {DECR_OP unary_expression} {} unary_expression {unary_operator cast_expression} {} unary_expression {SIZEOF unary_expression} {} unary_expression {SIZEOF LPAREN type_name RPAREN} {} unary_operator {ADDR_BITAND} {} unary_operator {STAR} {} unary_operator {PLUS} {} unary_operator {MINUS} {} unary_operator {BITNOT} {} unary_operator {LOGNOT} {} cast_expression {unary_expression} {} cast_expression {LPAREN type_name RPAREN cast_expression} {} multiplicative_expression {cast_expression} {} multiplicative_expression {multiplicative_expression MULT_STAR cast_expression} {} multiplicative_expression {multiplicative_expression DIV cast_expression} {} multiplicative_expression {multiplicative_expression REM cast_expression} {} additive_expression {multiplicative_expression} {} additive_expression {additive_expression PLUS multiplicative_expression} {} additive_expression {additive_expression MINUS multiplicative_expression} {} shift_expression {additive_expression} {} shift_expression {shift_expression LSHIFT additive_expression} {} shift_expression {shift_expression RSHIFT additive_expression} {} relational_expression {shift_expression} {} relational_expression {relational_expression LT shift_expression} {} relational_expression {relational_expression GT shift_expression} {} relational_expression {relational_expression LE shift_expression} {} relational_expression {relational_expression GE shift_expression} {} equality_expression {relational_expression} {} equality_expression {equality_expression EQ relational_expression} {} equality_expression {equality_expression NE relational_expression} {} and_expression {equality_expression} {} and_expression {and_expression ADDR_BITAND equality_expression} {} exclusive_or_expression {and_expression} {} exclusive_or_expression {exclusive_or_expression BITEOR and_expression} {} inclusive_or_expression {exclusive_or_expression} {} inclusive_or_expression {inclusive_or_expression BITOR exclusive_or_expression} {} logical_and_expression {inclusive_or_expression} {} logical_and_expression {logical_and_expression LOGAND inclusive_or_expression} {} logical_or_expression {logical_and_expression} {} logical_or_expression {logical_or_expression LOGOR logical_and_expression} {} conditional_expression {logical_or_expression} {} conditional_expression {logical_or_expression QUERY expression COLON conditional_expression} {} assignment_expression {conditional_expression} {} assignment_expression {unary_expression assignment_operator assignment_expression} {} assignment_operator {ASSIGN} {} assignment_operator {MUL_ASSIGN} {} assignment_operator {DIV_ASSIGN} {} assignment_operator {REM_ASSIGN} {} assignment_operator {PLUS_ASSIGN} {} assignment_operator {MINUS_ASSIGN} {} assignment_operator {LSHIFT_ASSIGN} {} assignment_operator {RSHIFT_ASSIGN} {} assignment_operator {BITAND_ASSIGN} {} assignment_operator {BITEOR_ASSIGN} {} assignment_operator {BITOR_ASSIGN} {} expression {assignment_expression} {} expression {expression COMMA assignment_expression} {} constant_expression {conditional_expression} {} declaration {declaration_specifiers SEMICOLON} {} declaration {declaration_specifiers init_declarator_list SEMICOLON} {} declaration_specifiers {storage_class_specifier} {} declaration_specifiers {storage_class_specifier declaration_specifiers} {} declaration_specifiers {type_specifier} {} declaration_specifiers {type_specifier declaration_specifiers} {} declaration_specifiers {type_qualifier} {} declaration_specifiers {type_qualifier declaration_specifiers} {} init_declarator_list {init_declarator} {} init_declarator_list {init_declarator_list COMMA init_declarator} {} init_declarator {declarator} {} init_declarator {declarator ASSIGN initializer} {} storage_class_specifier {TYPEDEF} {} storage_class_specifier {EXTERN} {} storage_class_specifier {STATIC} {} storage_class_specifier {AUTO} {} storage_class_specifier {REGISTER} {} type_specifier {VOID} {} type_specifier {CHAR} {} type_specifier {SHORT} {} type_specifier {INT} {} type_specifier {LONG} {} type_specifier {FLOAT} {} type_specifier {DOUBLE} {} type_specifier {SIGNED} {} type_specifier {UNSIGNED} {} type_specifier {struct_or_union_specifier} {} type_specifier {enum_specifier} {} type_specifier {TYPE_NAME} {} struct_or_union_specifier {struct_or_union IDENTIFIER LBRACE struct_declaration_list RBRACE} {} struct_or_union_specifier {struct_or_union LBRACE struct_declaration_list RBRACE} {} struct_or_union_specifier {struct_or_union IDENTIFIER} {} struct_or_union {STRUCT} {} struct_or_union {UNION} {} struct_declaration_list {struct_declaration} {} struct_declaration_list {struct_declaration_list struct_declaration} {} struct_declaration {specifier_qualifier_list struct_declarator_list SEMICOLON} {} specifier_qualifier_list {type_specifier specifier_qualifier_list} {} specifier_qualifier_list {type_specifier} {} specifier_qualifier_list {type_qualifier specifier_qualifier_list} {} specifier_qualifier_list {type_qualifier} {} struct_declarator_list {struct_declarator} {} struct_declarator_list {struct_declarator_list COMMA struct_declarator} {} struct_declarator {declarator} {} struct_declarator {COLON constant_expression} {} struct_declarator {declarator COLON constant_expression} {} enum_specifier {ENUM LBRACE enumerator_list RBRACE} {} enum_specifier {ENUM IDENTIFIER LBRACE enumerator_list RBRACE} {} enum_specifier {ENUM IDENTIFIER} {} enumerator_list {enumerator} {} enumerator_list {enumerator_list COMMA enumerator} {} enumerator {IDENTIFIER} {} enumerator {IDENTIFIER ASSIGN constant_expression} {} type_qualifier {CONST} {} type_qualifier {VOLATILE} {} declarator {pointer direct_declarator} {} declarator {direct_declarator} {} direct_declarator {IDENTIFIER} {} direct_declarator {LPAREN declarator RPAREN} {} direct_declarator {direct_declarator LBRACKET constant_expression RBRACKET} {} direct_declarator {direct_declarator LBRACKET RBRACKET} {} direct_declarator {direct_declarator LPAREN parameter_type_list RPAREN} {} direct_declarator {direct_declarator LPAREN identifier_list RPAREN} {} direct_declarator {direct_declarator LPAREN RPAREN} {} pointer {MULT_STAR} {} pointer {MULT_STAR type_qualifier_list} {} pointer {MULT_STAR pointer} {} pointer {MULT_STAR type_qualifier_list pointer} {} type_qualifier_list {type_qualifier} {} type_qualifier_list {type_qualifier_list type_qualifier} {} parameter_type_list {parameter_list} {} parameter_type_list {parameter_list COMMA ELLIPSIS} {} parameter_list {parameter_declaration} {} parameter_list {parameter_list COMMA parameter_declaration} {} parameter_declaration {declaration_specifiers declarator} {} parameter_declaration {declaration_specifiers abstract_declarator} {} parameter_declaration {declaration_specifiers} {} identifier_list {IDENTIFIER} {} identifier_list {identifier_list COMMA IDENTIFIER} {} type_name {specifier_qualifier_list} {} type_name {specifier_qualifier_list abstract_declarator} {} abstract_declarator {pointer} {} abstract_declarator {direct_abstract_declarator} {} abstract_declarator {pointer direct_abstract_declarator} {} direct_abstract_declarator {LPAREN abstract_declarator RPAREN} {} direct_abstract_declarator {LBRACKET RBRACKET} {} direct_abstract_declarator {LBRACKET constant_expression RBRACKET} {} direct_abstract_declarator {direct_abstract_declarator LBRACKET RBRACKET} {} direct_abstract_declarator {direct_abstract_declarator LBRACKET constant_expression RBRACKET} {} direct_abstract_declarator {LPAREN RPAREN} {} direct_abstract_declarator {LPAREN parameter_type_list RPAREN} {} direct_abstract_declarator {direct_abstract_declarator LPAREN RPAREN} {} direct_abstract_declarator {direct_abstract_declarator LPAREN parameter_type_list RPAREN} {} initializer {assignment_expression} {} initializer {LBRACE initializer_list RBRACE} {} initializer {LBRACE initializer_list COMMA RBRACE} {} initializer_list {initializer} {} initializer_list {initializer_list COMMA initializer} {} statement {labeled_statement} {} statement {compound_statement} {} statement {expression_statement} {} statement {selection_statement} {} statement {iteration_statement} {} statement {jump_statement} {} labeled_statement {IDENTIFIER COLON statement} {} labeled_statement {CASE constant_expression COLON statement} {} labeled_statement {DEFAULT COLON statement} {} compound_statement {LBRACE RBRACE} {} compound_statement {LBRACE statement_list RBRACE} {} compound_statement {LBRACE declaration_list RBRACE} {} compound_statement {LBRACE declaration_list statement_list RBRACE} {} declaration_list {declaration} {} declaration_list {declaration_list declaration} {} statement_list {statement} {} statement_list {statement_list statement} {} expression_statement {SEMICOLON} {} expression_statement {expression SEMICOLON} {} selection_statement {IF LPAREN expression RPAREN statement} {} selection_statement {IF LPAREN expression RPAREN statement ELSE statement} {} selection_statement {SWITCH LPAREN expression RPAREN statement} {} iteration_statement {WHILE LPAREN expression RPAREN statement} {} iteration_statement {DO statement WHILE LPAREN expression RPAREN SEMICOLON} {} iteration_statement {FOR LPAREN expression_statement expression_statement RPAREN statement} {} iteration_statement {FOR LPAREN expression_statement expression_statement expression RPAREN statement} {} jump_statement {GOTO IDENTIFIER SEMICOLON} {} jump_statement {CONTINUE SEMICOLON} {} jump_statement {BREAK SEMICOLON} {} jump_statement {RETURN SEMICOLON} {} jump_statement {RETURN expression SEMICOLON} {} translation_unit {external_declaration} {} translation_unit {translation_unit external_declaration} {} external_declaration {function_definition} {} external_declaration {declaration} {} function_definition {declaration_specifiers declarator declaration_list compound_statement} {} function_definition {declaration_specifiers declarator compound_statement} {} function_definition {declarator declaration_list compound_statement} {} function_definition {declarator compound_statement} {#ignore} } # Note the comment in the last definition: The current version of Yeti (0.4) # needs at least one rule with a non-empty script # Dump the generated parser. set f [open cparser.tcl w] puts $f [$pg dump] close $f delete object $pg # Done. exit ---- [GPS]: This is an interesting project. What's next? Is a project page available for this somewhere(SF?)? Currently I am toying a bit with [Yeti], to make it more flexible and give me more control over the generated code. From there I would add code here to generate the parse tree of the code, and of course, management of the symbol table so that lexer and parser actually work together (see notes). Beyond that ? I don't know yet. Things like conversion of the parse tree to a more abstract tree, type checking, ... Way later: code generation. SF - No, there is no project on SF about this. Well, the nearest is [David Cuthbert]'s '''kt2c'''. This translates tcl procedures into equivalent C. ---- (H. Giese, 2004-06-30) There seemed to be a gap between the lexer from [Lexing C] and the parser above: As they stand these two don't fit to each other. Below is a small adapter class, which makes them compatible and a driver to combine all the pieces. Please note, that the current version of Yeti (0.4) requires that at least one rule of the grammar has a non-empty script associated with it (but a comment suffices). '''File lexadapter.tcl''' # # An adapter to pass the output generated by clex.tcl to the (yeti based) # parser generated by cc.tcl from above. # package require Itcl foreach classDef {LexAdapter} { if { [info command $classDef] != "" } { catch {itcl::delete class $classDef} } } itcl::class LexAdapter { public variable _currIdx 0 public variable _sym {} ;# list of symbols public variable _aIdxLst {} ;# list of attribute indices public variable _aValLst {} ;# list of corresponding attribute values public variable _aIdx 0 constructor {data} { foreach {_sym attr} $data {break} foreach {_aIdxLst _aValLst} $attr {break} } # # Return the next symbol. If the symbol has a value (an attribute in the # notation of clex) return it, too. This case is signaled by clex by # appending a '-' to the symbol's name. # Note: The scanner recognises comments, but we don't want to pass those # to the parser. Hence 'next' is structured as a 'while' loop, which will # 'continue' on a comment and 'break' in all other cases. # method next {} { while 1 { if { $_currIdx >= [llength $_sym] } { set res [list] } else { set res [lindex $_sym $_currIdx] incr _currIdx if { [string index $res end] eq "-" } { # need to handle an "attribute" set attrIdx [lindex $_aIdxLst $_aIdx] incr _aIdx if { $res eq "COMMENT-" } { continue ;# parser doesn't want comments } if { $res eq "IDENT-" } { set res IDENTIFIER- ;# parser uses different name } # remove '-' set res [string range $res 0 end-1] lappend res [lindex $_aValLst $attrIdx] } } break } return $res } } The driver. For the moment it is probably easiest to have all needed files in one directory (clex.tcl from [Lexing C], lexbase.tcl from [Scripted Lexing] plus the files from this page). # # First steps parsing C # # This version uses the clex lexer and the adapter from lexadapter.tcl # package require Itcl ;# on behalf of cparser.tcl # don't have a pkgIndex.tcl for this one, but clex wants to [package require] it source lexbase.tcl # the lexer and parser used here source clex.tcl source cparser.tcl # the adapter: 'lexer output -> yeti's expectations' source lexadapter.tcl # what to parse ? if { $argv == [list] } { set fName mini.c } else { set fName [lindex $argv 0] } puts "parsing $fName" set input [read [set fh [open $fName]]][close $fh] set code [lexbase::lex $input] # # Instantiate scanner and parser # set scanner [LexAdapter #auto $code] #$scanner configure -verbose 1 set cp [CParser #auto -scanner $scanner] #$cp configure -verbose 1 $cp reset # all this effort to be able to issue _this_ command $cp parse ---- [[ [Scripted Compiler] :: '''Parsing C''' :: --> [Lexing C] ]]