Aug 17, 2002 -- AK | [ Lexing C ] <-- [ Scripted Compiler ]
Based upon Lexing C this page introduces the main components of a parser for the C language. This page is part of the Scripted Compiler project.
The lexer shown here is a variant of the original lexer defined in Lexing C. The changes are:
Beyond the lexer you will find a grammar definition for C useable by Yeti. This definition is based upon [L1 ]. Running it through Yeti will generate a LALR(1) shift/reduce parser for C.
Not shown on this page:
AK - Aug 23, 2002: More notes: The reinsertion of strings into the source, before the listification, is sub-optimal. We should keep the short id;s and insert them in a postprocessing pass of the list. Thay we keep the constructed objects, don't shift long character sequences around the memory, ... Guess I'll have to time things.
clex.tcl
# -*- tcl -*- # Lexing C package provide clex 1.0 namespace eval clex {} # Most tokens can be recognized and separated from the others via # [string map]. The remainder are identifiers. Except for strings, # and comments. These may contain any other token, and we may not act # upon them. And both may contain sequences reminiscent of the # other. Because of that a first step is to identify and parse out # comments and strings, where 'parsing out' means that these tokens # are replaced with special char sequences which refer the lexer to a # small database. In the final output they are placed back. proc clex::SCextract {code mapvar} { # Extract strings and comments from the code and place them in mapvar. # Replace them with ids upvar $mapvar map set tmp "" set count 0 while {1} { set comment [string first "/*" $code] set string [string first "\"" $code] if {($comment < 0) && ($string < 0)} { # No comments nor strings to extract anymore. break } # Debug out #puts "$string | $comment | [string length $code]" if { (($string >= 0) && ($comment >= 0) && ($string < $comment)) || (($string >= 0) && ($comment < 0)) } { # The next vari-sized thing is a "-quoted string. # Finding its end is bit more difficult, because we have # to accept \" as one character inside of the string. set from $string while 1 { incr from set stop [string first "\"" $code $from] set stopb [string first "\\\"" $code $from] incr stopb if {$stop == $stopb} {set from $stopb ; incr from ; continue} break } set id \000@$count\000 incr count lappend map $id [set sub [string range $code $string $stop]] incr stop ; incr string -1 append tmp [string range $code 0 $string] $id set code [string range $code $stop end] # Debug out #puts "\tSTR $id <$sub>" continue } if { (($string >= 0) && ($comment >= 0) && ($comment < $string)) || (($comment >= 0) && ($string < 0)) } { # The next vari-sized thing is a comment. # We ignore comments. set stop [string first "*/" $code $comment] incr stop 2 incr comment -1 append tmp [string range $code 0 $comment] set code [string range $code $stop end] continue } return -code error "Panic, string and comment at some location" } append tmp $code return $tmp } proc clex::DefStart {} { variable tokens [list] #puts "== <$tokens>" return } proc clex::Key {string {replacement {}}} { variable tokens if {$replacement == {}} { set replacement \000\001[string toupper $string]\000 } else { set replacement \000\001$replacement\000 } lappend tokens $string $replacement #puts "== <$tokens>" return } proc clex::DefEnd {} { variable tokens array set tmp $tokens set res [list] foreach key [lsort -decreasing [array names tmp]] { lappend res $key $tmp($key) } set tokens $res #puts "== <$tokens>" return } proc clex::lex {code} { variable tokens # Phase I ... Extract strings and comments so that they don't interfere # with the remaining phases. # Phase II ... Separate all constant-sized tokens (keywords and # punctuation) from each other. # Phase III ... Separate whitespace from the useful text. # Actually converts whitespace into separator characters. # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences set scmap [list] if 0 { # Minimal number of commands for all phases regsub -all -- "\[\t\n \]+" [string map $tokens \ [SCextract $code scmap]] \ \000 tmp set code [split \ [string trim \ [string map "\000\000\000 \000 \000\000 \000" \ [string map $scmap \ $tmp]] \000] \000] } if 1 { # Each phase spelled out explicitly ... set code [SCextract $code scmap] ; # I set code [string map $tokens $code] ; # II regsub -all -- "\[\t\n \]+" $code \000 code ; # III set code [string map $scmap $code] ; # IV/a set code [string map "\000\000\000 \000 \000\000 \000" $code] ; # IV/b set code [string trim $code \000] set code [split $code \000] } # Run through the list and create something useable by the parser. # # A list of pairs (pairs being lists of 2 elements), where each # pair contains the symbol to give to the parser, and associated # data, if any. set tmp [list] foreach lex $code { switch -glob -- [string index $lex 0] { \001 { # Keyword, no data. lappend tmp [list [string range $lex 1 end] {}] } ' - [0-9] { # Character or numeric constant. lappend tmp [list CONSTANT $lex] } \" { # String literal. Strip the double-quotes. lappend tmp [list STRING_LITERAL [string range $lex 1 end-1]] } default { # Identifier. This code does not distinguish # identifiers and type-names yet. This is defered to # the 'scanner', i.e. the glue code feeding the lexer # symbols into the parser. lappend tmp [list IDENTIFIER $lex] } } } set code $tmp return $code } namespace eval clex { DefStart Key ( LPAREN ; Key ) RPAREN ; Key -> DEREF Key < LT ; Key <= LE ; Key == EQ Key > GT ; Key >= GE ; Key != NE Key \[ LBRACKET ; Key \] RBRACKET ; Key = ASSIGN Key \{ LBRACE ; Key \} RBRACE ; Key *= MUL_ASSIGN Key . DOT ; Key , COMMA ; Key /= DIV_ASSIGN Key ++ INCR_OP ; Key -- DECR_OP ; Key %= REM_ASSIGN Key & ADDR_BITAND ; Key * MULT_STAR ; Key += PLUS_ASSIGN Key + PLUS ; Key - MINUS ; Key -= MINUS_ASSIGN Key ~ BITNOT ; Key ! LOGNOT ; Key <<= LSHIFT_ASSIGN Key / DIV ; Key % REM ; Key >>= RSHIFT_ASSIGN Key << LSHIFT ; Key >> RSHIFT ; Key &= BITAND_ASSIGN Key ^ BITEOR ; Key && LOGAND ; Key ^= BITEOR_ASSIGN Key | BITOR ; Key || LOGOR ; Key |= BITOR_ASSIGN Key ? QUERY ; Key : COLON ; Key \; SEMICOLON Key ... ELLIPSIS Key typedef ; Key extern ; Key static ; Key auto ; Key register Key void ; Key char ; Key short ; Key int ; Key long Key float ; Key double ; Key signed ; Key unsigned Key goto ; Key continue ; Key break ; Key return Key case ; Key default ; Key switch Key struct ; Key union ; Key enum Key while ; Key do ; Key for Key const ; Key volatile Key if ; Key else Key sizeof DefEnd }
cc.tcl
# -*- tcl -*- # C parser. lappend auto_path [file join [file dirname [info script]]] package require yeti set pg [yeti::yeti \#auto -name CParser] ## Notes # ## The lexem TYPE_NAME is from the pure lexers point of view the same ## as IDENTIFIER, i.e. both have the same lexer definition. The ## information to distinguish between both of them comes from the ## parser stage, or rather from the symbol table it manages. This ## means that the lexer will lookup any IDENTIFIER in the symbol table ## and if that identifier is associated with a type it will return the ## symbol TYPE_NAME instead of IDENTIFIER. # ## Why this distinction ? # ## Because the C grammar has an inherent ambiguity in several places ## where a shift/reduce parser will be unable to select the correct ## reduction if no distinction is made between identifiers ## representing type names and all other identifiers. # ## The relevant pairs of LR(0) items are listed below. No distinction ## is possible if only IDENTIFIER is used in the grammar. # ## type_specifier --> TYPE_NAME . ## primary_expression --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## identifier_list --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## direct_declarator --> IDENTIFIER . $pg configure -start translation_unit -verbose 2 $pg add { primary_expression {IDENTIFIER} {} primary_expression {CONSTANT} {} primary_expression {STRING_LITERAL} {} primary_expression {LPAREN expression RPAREN} {} postfix_expression {primary_expression} {} postfix_expression {postfix_expression LBRACKET expression RBRACKET} {} postfix_expression {postfix_expression LPAREN RPAREN} {} postfix_expression {postfix_expression LPAREN argument_expression_list RPAREN} {} postfix_expression {postfix_expression DOT IDENTIFIER} {} postfix_expression {postfix_expression DEREF IDENTIFIER} {} postfix_expression {postfix_expression INCR_OP} {} postfix_expression {postfix_expression DECR_OP} {} argument_expression_list {assignment_expression} {} argument_expression_list {argument_expression_list COMMA assignment_expression} {} unary_expression {postfix_expression} {} unary_expression {INCR_OP unary_expression} {} unary_expression {DECR_OP unary_expression} {} unary_expression {unary_operator cast_expression} {} unary_expression {SIZEOF unary_expression} {} unary_expression {SIZEOF LPAREN type_name RPAREN} {} unary_operator {ADDR_BITAND} {} unary_operator {STAR} {} unary_operator {PLUS} {} unary_operator {MINUS} {} unary_operator {BITNOT} {} unary_operator {LOGNOT} {} cast_expression {unary_expression} {} cast_expression {LPAREN type_name RPAREN cast_expression} {} multiplicative_expression {cast_expression} {} multiplicative_expression {multiplicative_expression MULT_STAR cast_expression} {} multiplicative_expression {multiplicative_expression DIV cast_expression} {} multiplicative_expression {multiplicative_expression REM cast_expression} {} additive_expression {multiplicative_expression} {} additive_expression {additive_expression PLUS multiplicative_expression} {} additive_expression {additive_expression MINUS multiplicative_expression} {} shift_expression {additive_expression} {} shift_expression {shift_expression LSHIFT additive_expression} {} shift_expression {shift_expression RSHIFT additive_expression} {} relational_expression {shift_expression} {} relational_expression {relational_expression LT shift_expression} {} relational_expression {relational_expression GT shift_expression} {} relational_expression {relational_expression LE shift_expression} {} relational_expression {relational_expression GE shift_expression} {} equality_expression {relational_expression} {} equality_expression {equality_expression EQ relational_expression} {} equality_expression {equality_expression NE relational_expression} {} and_expression {equality_expression} {} and_expression {and_expression ADDR_BITAND equality_expression} {} exclusive_or_expression {and_expression} {} exclusive_or_expression {exclusive_or_expression BITEOR and_expression} {} inclusive_or_expression {exclusive_or_expression} {} inclusive_or_expression {inclusive_or_expression BITOR exclusive_or_expression} {} logical_and_expression {inclusive_or_expression} {} logical_and_expression {logical_and_expression LOGAND inclusive_or_expression} {} logical_or_expression {logical_and_expression} {} logical_or_expression {logical_or_expression LOGOR logical_and_expression} {} conditional_expression {logical_or_expression} {} conditional_expression {logical_or_expression QUERY expression COLON conditional_expression} {} assignment_expression {conditional_expression} {} assignment_expression {unary_expression assignment_operator assignment_expression} {} assignment_operator {ASSIGN} {} assignment_operator {MUL_ASSIGN} {} assignment_operator {DIV_ASSIGN} {} assignment_operator {REM_ASSIGN} {} assignment_operator {PLUS_ASSIGN} {} assignment_operator {MINUS_ASSIGN} {} assignment_operator {LSHIFT_ASSIGN} {} assignment_operator {RSHIFT_ASSIGN} {} assignment_operator {BITAND_ASSIGN} {} assignment_operator {BITEOR_ASSIGN} {} assignment_operator {BITOR_ASSIGN} {} expression {assignment_expression} {} expression {expression COMMA assignment_expression} {} constant_expression {conditional_expression} {} declaration {declaration_specifiers SEMICOLON} {} declaration {declaration_specifiers init_declarator_list SEMICOLON} {} declaration_specifiers {storage_class_specifier} {} declaration_specifiers {storage_class_specifier declaration_specifiers} {} declaration_specifiers {type_specifier} {} declaration_specifiers {type_specifier declaration_specifiers} {} declaration_specifiers {type_qualifier} {} declaration_specifiers {type_qualifier declaration_specifiers} {} init_declarator_list {init_declarator} {} init_declarator_list {init_declarator_list COMMA init_declarator} {} init_declarator {declarator} {} init_declarator {declarator ASSIGN initializer} {} storage_class_specifier {TYPEDEF} {} storage_class_specifier {EXTERN} {} storage_class_specifier {STATIC} {} storage_class_specifier {AUTO} {} storage_class_specifier {REGISTER} {} type_specifier {VOID} {} type_specifier {CHAR} {} type_specifier {SHORT} {} type_specifier {INT} {} type_specifier {LONG} {} type_specifier {FLOAT} {} type_specifier {DOUBLE} {} type_specifier {SIGNED} {} type_specifier {UNSIGNED} {} type_specifier {struct_or_union_specifier} {} type_specifier {enum_specifier} {} type_specifier {TYPE_NAME} {} struct_or_union_specifier {struct_or_union IDENTIFIER LBRACE struct_declaration_list RBRACE} {} struct_or_union_specifier {struct_or_union LBRACE struct_declaration_list RBRACE} {} struct_or_union_specifier {struct_or_union IDENTIFIER} {} struct_or_union {STRUCT} {} struct_or_union {UNION} {} struct_declaration_list {struct_declaration} {} struct_declaration_list {struct_declaration_list struct_declaration} {} struct_declaration {specifier_qualifier_list struct_declarator_list SEMICOLON} {} specifier_qualifier_list {type_specifier specifier_qualifier_list} {} specifier_qualifier_list {type_specifier} {} specifier_qualifier_list {type_qualifier specifier_qualifier_list} {} specifier_qualifier_list {type_qualifier} {} struct_declarator_list {struct_declarator} {} struct_declarator_list {struct_declarator_list COMMA struct_declarator} {} struct_declarator {declarator} {} struct_declarator {COLON constant_expression} {} struct_declarator {declarator COLON constant_expression} {} enum_specifier {ENUM LBRACE enumerator_list RBRACE} {} enum_specifier {ENUM IDENTIFIER LBRACE enumerator_list RBRACE} {} enum_specifier {ENUM IDENTIFIER} {} enumerator_list {enumerator} {} enumerator_list {enumerator_list COMMA enumerator} {} enumerator {IDENTIFIER} {} enumerator {IDENTIFIER ASSIGN constant_expression} {} type_qualifier {CONST} {} type_qualifier {VOLATILE} {} declarator {pointer direct_declarator} {} declarator {direct_declarator} {} direct_declarator {IDENTIFIER} {} direct_declarator {LPAREN declarator RPAREN} {} direct_declarator {direct_declarator LBRACKET constant_expression RBRACKET} {} direct_declarator {direct_declarator LBRACKET RBRACKET} {} direct_declarator {direct_declarator LPAREN parameter_type_list RPAREN} {} direct_declarator {direct_declarator LPAREN identifier_list RPAREN} {} direct_declarator {direct_declarator LPAREN RPAREN} {} pointer {MULT_STAR} {} pointer {MULT_STAR type_qualifier_list} {} pointer {MULT_STAR pointer} {} pointer {MULT_STAR type_qualifier_list pointer} {} type_qualifier_list {type_qualifier} {} type_qualifier_list {type_qualifier_list type_qualifier} {} parameter_type_list {parameter_list} {} parameter_type_list {parameter_list COMMA ELLIPSIS} {} parameter_list {parameter_declaration} {} parameter_list {parameter_list COMMA parameter_declaration} {} parameter_declaration {declaration_specifiers declarator} {} parameter_declaration {declaration_specifiers abstract_declarator} {} parameter_declaration {declaration_specifiers} {} identifier_list {IDENTIFIER} {} identifier_list {identifier_list COMMA IDENTIFIER} {} type_name {specifier_qualifier_list} {} type_name {specifier_qualifier_list abstract_declarator} {} abstract_declarator {pointer} {} abstract_declarator {direct_abstract_declarator} {} abstract_declarator {pointer direct_abstract_declarator} {} direct_abstract_declarator {LPAREN abstract_declarator RPAREN} {} direct_abstract_declarator {LBRACKET RBRACKET} {} direct_abstract_declarator {LBRACKET constant_expression RBRACKET} {} direct_abstract_declarator {direct_abstract_declarator LBRACKET RBRACKET} {} direct_abstract_declarator {direct_abstract_declarator LBRACKET constant_expression RBRACKET} {} direct_abstract_declarator {LPAREN RPAREN} {} direct_abstract_declarator {LPAREN parameter_type_list RPAREN} {} direct_abstract_declarator {direct_abstract_declarator LPAREN RPAREN} {} direct_abstract_declarator {direct_abstract_declarator LPAREN parameter_type_list RPAREN} {} initializer {assignment_expression} {} initializer {LBRACE initializer_list RBRACE} {} initializer {LBRACE initializer_list COMMA RBRACE} {} initializer_list {initializer} {} initializer_list {initializer_list COMMA initializer} {} statement {labeled_statement} {} statement {compound_statement} {} statement {expression_statement} {} statement {selection_statement} {} statement {iteration_statement} {} statement {jump_statement} {} labeled_statement {IDENTIFIER COLON statement} {} labeled_statement {CASE constant_expression COLON statement} {} labeled_statement {DEFAULT COLON statement} {} compound_statement {LBRACE RBRACE} {} compound_statement {LBRACE statement_list RBRACE} {} compound_statement {LBRACE declaration_list RBRACE} {} compound_statement {LBRACE declaration_list statement_list RBRACE} {} declaration_list {declaration} {} declaration_list {declaration_list declaration} {} statement_list {statement} {} statement_list {statement_list statement} {} expression_statement {SEMICOLON} {} expression_statement {expression SEMICOLON} {} selection_statement {IF LPAREN expression RPAREN statement} {} selection_statement {IF LPAREN expression RPAREN statement ELSE statement} {} selection_statement {SWITCH LPAREN expression RPAREN statement} {} iteration_statement {WHILE LPAREN expression RPAREN statement} {} iteration_statement {DO statement WHILE LPAREN expression RPAREN SEMICOLON} {} iteration_statement {FOR LPAREN expression_statement expression_statement RPAREN statement} {} iteration_statement {FOR LPAREN expression_statement expression_statement expression RPAREN statement} {} jump_statement {GOTO IDENTIFIER SEMICOLON} {} jump_statement {CONTINUE SEMICOLON} {} jump_statement {BREAK SEMICOLON} {} jump_statement {RETURN SEMICOLON} {} jump_statement {RETURN expression SEMICOLON} {} translation_unit {external_declaration} {} translation_unit {translation_unit external_declaration} {} external_declaration {function_definition} {} external_declaration {declaration} {} function_definition {declaration_specifiers declarator declaration_list compound_statement} {} function_definition {declaration_specifiers declarator compound_statement} {} function_definition {declarator declaration_list compound_statement} {} function_definition {declarator compound_statement} {} } # Dump the generated parser. set f [open cparser.tcl w] puts $f [$pg dump] close $f delete object $pg # Done. exit
[ Lexing C ] <-- [ Scripted Compiler ]
GPS: This is an interesting project. What's next? Is a project page available for this somewhere(SF?)?
Currently I am toying a bit with Yeti, to make it more flexible and give me more control over the generated code. From there I would add code here to generate the parse tree of the code, and of course, management of the symbol table so that lexer and parser actually work together (see notes).
Beyond that ? I don't know yet. Things like conversion of the parse tree to a more abstract tree, type checking, ...
Way later: code generation.
SF - No, there is no project on SF about this. Well, the nearest is David Cuthbert's kt2c. This translates tcl procedures into equivalent C.