Aug 17, 2002 -- [AK] | [[ [Lexing C] ]] <-- [[ [Scripted Compiler] ]] ---- Based upon [Lexing C] this page introduces the main components of a parser for the [C language]. This page is part of the [Scripted Compiler] project. The lexer shown here is a variant of the original lexer defined in [Lexing C]. The changes are: * It ignores C comments. * Keywords and punctuation are converted into true lexer symbols. * All internal lexer symbols can be distinguished through their first character. Keywords and punctuation are for example introduced by \001. * The list generated by the older lexer is not directly used as result, but processed again, into a list of pairs, lexer symbols and their associated lexer data (value of literals and constants, the name of an identifier, ...). Beyond the lexer you will find a grammar definition for C useable by [Yeti]. This definition is based upon [http://www.vendian.org/mncharity/ccode/grammar/local/degener_symb.txt]. Running it through Yeti will generate a LALR(1) shift/reduce parser for C. Not shown on this page: * The glue between the token list and the yeti-parser. This scanner has to perform one additional transformation, the conversion of identifiers for type names into the symbol TYPE_NAME. Why this has to be done is explained in the section defining the grammar. * The code to execute when rules are reduced by the parser. This code has to maintain at least a symbol-table, to support the lexer in his task, see above. Beyond that nearly anything will go. This could be a direct translation to machine code, the generation of an exact parse tree, generation of an abstract syntax tree, or whatever your imagination will come up with. * The generated parser. It comes in at 105 Kilobyte of Tcl code and is thus much to large for this page. And it can be generated anyways via [Yeti], so there is really no reason to place it here. [AK] - Aug 23, 2002: More notes: The reinsertion of strings into the source, before the listification, is sub-optimal. We should keep the ''short'' id;s and insert them in a postprocessing pass of the list. Thay we keep the constructed objects, don't shift long character sequences around the memory, ... Guess I'll have to [time] things. ---- '''clex.tcl''' # -*- tcl -*- # Lexing C package provide clex 1.0 namespace eval clex {} # Most tokens can be recognized and separated from the others via # [string map]. The remainder are identifiers. Except for strings, # and comments. These may contain any other token, and we may not act # upon them. And both may contain sequences reminiscent of the # other. Because of that a first step is to identify and parse out # comments and strings, where 'parsing out' means that these tokens # are replaced with special char sequences which refer the lexer to a # small database. In the final output they are placed back. proc clex::SCextract {code mapvar} { # Extract strings and comments from the code and place them in mapvar. # Replace them with ids upvar $mapvar map set tmp "" set count 0 while {1} { set comment [string first "/*" $code] set string [string first "\"" $code] if {($comment < 0) && ($string < 0)} { # No comments nor strings to extract anymore. break } # Debug out #puts "$string | $comment | [string length $code]" if { (($string >= 0) && ($comment >= 0) && ($string < $comment)) || (($string >= 0) && ($comment < 0)) } { # The next vari-sized thing is a "-quoted string. # Finding its end is bit more difficult, because we have # to accept \" as one character inside of the string. set from $string while 1 { incr from set stop [string first "\"" $code $from] set stopb [string first "\\\"" $code $from] incr stopb if {$stop == $stopb} {set from $stopb ; incr from ; continue} break } set id \000@$count\000 incr count lappend map $id [set sub [string range $code $string $stop]] incr stop ; incr string -1 append tmp [string range $code 0 $string] $id set code [string range $code $stop end] # Debug out #puts "\tSTR $id <$sub>" continue } if { (($string >= 0) && ($comment >= 0) && ($comment < $string)) || (($comment >= 0) && ($string < 0)) } { # The next vari-sized thing is a comment. # We ignore comments. set stop [string first "*/" $code $comment] incr stop 2 incr comment -1 append tmp [string range $code 0 $comment] set code [string range $code $stop end] continue } return -code error "Panic, string and comment at some location" } append tmp $code return $tmp } proc clex::DefStart {} { variable tokens [list] #puts "== <$tokens>" return } proc clex::Key {string {replacement {}}} { variable tokens if {$replacement == {}} { set replacement \000\001[string toupper $string]\000 } else { set replacement \000\001$replacement\000 } lappend tokens $string $replacement #puts "== <$tokens>" return } proc clex::DefEnd {} { variable tokens array set tmp $tokens set res [list] foreach key [lsort -decreasing [array names tmp]] { lappend res $key $tmp($key) } set tokens $res #puts "== <$tokens>" return } proc clex::lex {code} { variable tokens # Phase I ... Extract strings and comments so that they don't interfere # with the remaining phases. # Phase II ... Separate all constant-sized tokens (keywords and # punctuation) from each other. # Phase III ... Separate whitespace from the useful text. # Actually converts whitespace into separator characters. # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences set scmap [list] if 0 { # Minimal number of commands for all phases regsub -all -- "\[\t\n \]+" [string map $tokens \ [SCextract $code scmap]] \ \000 tmp set code [split \ [string trim \ [string map "\000\000\000 \000 \000\000 \000" \ [string map $scmap \ $tmp]] \000] \000] } if 1 { # Each phase spelled out explicitly ... set code [SCextract $code scmap] ; # I set code [string map $tokens $code] ; # II regsub -all -- "\[\t\n \]+" $code \000 code ; # III set code [string map $scmap $code] ; # IV/a set code [string map "\000\000\000 \000 \000\000 \000" $code] ; # IV/b set code [string trim $code \000] set code [split $code \000] } # Run through the list and create something useable by the parser. # # A list of pairs (pairs being lists of 2 elements), where each # pair contains the symbol to give to the parser, and associated # data, if any. set tmp [list] foreach lex $code { switch -glob -- [string index $lex 0] { \001 { # Keyword, no data. lappend tmp [list [string range $lex 1 end] {}] } ' - [0-9] { # Character or numeric constant. lappend tmp [list CONSTANT $lex] } \" { # String literal. Strip the double-quotes. lappend tmp [list STRING_LITERAL [string range $lex 1 end-1]] } default { # Identifier. This code does not distinguish # identifiers and type-names yet. This is defered to # the 'scanner', i.e. the glue code feeding the lexer # symbols into the parser. lappend tmp [list IDENTIFIER $lex] } } } set code $tmp return $code } namespace eval clex { DefStart Key ( LPAREN ; Key ) RPAREN ; Key -> DEREF Key < LT ; Key <= LE ; Key == EQ Key > GT ; Key >= GE ; Key != NE Key \[ LBRACKET ; Key \] RBRACKET ; Key = ASSIGN Key \{ LBRACE ; Key \} RBRACE ; Key *= MUL_ASSIGN Key . DOT ; Key , COMMA ; Key /= DIV_ASSIGN Key ++ INCR_OP ; Key -- DECR_OP ; Key %= REM_ASSIGN Key & ADDR_BITAND ; Key * MULT_STAR ; Key += PLUS_ASSIGN Key + PLUS ; Key - MINUS ; Key -= MINUS_ASSIGN Key ~ BITNOT ; Key ! LOGNOT ; Key <<= LSHIFT_ASSIGN Key / DIV ; Key % REM ; Key >>= RSHIFT_ASSIGN Key << LSHIFT ; Key >> RSHIFT ; Key &= BITAND_ASSIGN Key ^ BITEOR ; Key && LOGAND ; Key ^= BITEOR_ASSIGN Key | BITOR ; Key || LOGOR ; Key |= BITOR_ASSIGN Key ? QUERY ; Key : COLON ; Key \; SEMICOLON Key ... ELLIPSIS Key typedef ; Key extern ; Key static ; Key auto ; Key register Key void ; Key char ; Key short ; Key int ; Key long Key float ; Key double ; Key signed ; Key unsigned Key goto ; Key continue ; Key break ; Key return Key case ; Key default ; Key switch Key struct ; Key union ; Key enum Key while ; Key do ; Key for Key const ; Key volatile Key if ; Key else Key sizeof DefEnd } ---- '''cc.tcl''' # -*- tcl -*- # C parser. lappend auto_path [file join [file dirname [info script]]] package require yeti set pg [yeti::yeti \#auto -name CParser] ## Notes # ## The lexem TYPE_NAME is from the pure lexers point of view the same ## as IDENTIFIER, i.e. both have the same lexer definition. The ## information to distinguish between both of them comes from the ## parser stage, or rather from the symbol table it manages. This ## means that the lexer will lookup any IDENTIFIER in the symbol table ## and if that identifier is associated with a type it will return the ## symbol TYPE_NAME instead of IDENTIFIER. # ## Why this distinction ? # ## Because the C grammar has an inherent ambiguity in several places ## where a shift/reduce parser will be unable to select the correct ## reduction if no distinction is made between identifiers ## representing type names and all other identifiers. # ## The relevant pairs of LR(0) items are listed below. No distinction ## is possible if only IDENTIFIER is used in the grammar. # ## type_specifier --> TYPE_NAME . ## primary_expression --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## identifier_list --> IDENTIFIER . # ## type_specifier --> TYPE_NAME . ## direct_declarator --> IDENTIFIER . $pg configure -start translation_unit -verbose 2 $pg add { primary_expression {IDENTIFIER} {} primary_expression {CONSTANT} {} primary_expression {STRING_LITERAL} {} primary_expression {LPAREN expression RPAREN} {} postfix_expression {primary_expression} {} postfix_expression {postfix_expression LBRACKET expression RBRACKET} {} postfix_expression {postfix_expression LPAREN RPAREN} {} postfix_expression {postfix_expression LPAREN argument_expression_list RPAREN} {} postfix_expression {postfix_expression DOT IDENTIFIER} {} postfix_expression {postfix_expression DEREF IDENTIFIER} {} postfix_expression {postfix_expression INCR_OP} {} postfix_expression {postfix_expression DECR_OP} {} argument_expression_list {assignment_expression} {} argument_expression_list {argument_expression_list COMMA assignment_expression} {} unary_expression {postfix_expression} {} unary_expression {INCR_OP unary_expression} {} unary_expression {DECR_OP unary_expression} {} unary_expression {unary_operator cast_expression} {} unary_expression {SIZEOF unary_expression} {} unary_expression {SIZEOF LPAREN type_name RPAREN} {} unary_operator {ADDR_BITAND} {} unary_operator {STAR} {} unary_operator {PLUS} {} unary_operator {MINUS} {} unary_operator {BITNOT} {} unary_operator {LOGNOT} {} cast_expression {unary_expression} {} cast_expression {LPAREN type_name RPAREN cast_expression} {} multiplicative_expression {cast_expression} {} multiplicative_expression {multiplicative_expression MULT_STAR cast_expression} {} multiplicative_expression {multiplicative_expression DIV cast_expression} {} multiplicative_expression {multiplicative_expression REM cast_expression} {} additive_expression {multiplicative_expression} {} additive_expression {additive_expression PLUS multiplicative_expression} {} additive_expression {additive_expression MINUS multiplicative_expression} {} shift_expression {additive_expression} {} shift_expression {shift_expression LSHIFT additive_expression} {} shift_expression {shift_expression RSHIFT additive_expression} {} relational_expression {shift_expression} {} relational_expression {relational_expression LT shift_expression} {} relational_expression {relational_expression GT shift_expression} {} relational_expression {relational_expression LE shift_expression} {} relational_expression {relational_expression GE shift_expression} {} equality_expression {relational_expression} {} equality_expression {equality_expression EQ relational_expression} {} equality_expression {equality_expression NE relational_expression} {} and_expression {equality_expression} {} and_expression {and_expression ADDR_BITAND equality_expression} {} exclusive_or_expression {and_expression} {} exclusive_or_expression {exclusive_or_expression BITEOR and_expression} {} inclusive_or_expression {exclusive_or_expression} {} inclusive_or_expression {inclusive_or_expression BITOR exclusive_or_expression} {} logical_and_expression {inclusive_or_expression} {} logical_and_expression {logical_and_expression LOGAND inclusive_or_expression} {} logical_or_expression {logical_and_expression} {} logical_or_expression {logical_or_expression LOGOR logical_and_expression} {} conditional_expression {logical_or_expression} {} conditional_expression {logical_or_expression QUERY expression COLON conditional_expression} {} assignment_expression {conditional_expression} {} assignment_expression {unary_expression assignment_operator assignment_expression} {} assignment_operator {ASSIGN} {} assignment_operator {MUL_ASSIGN} {} assignment_operator {DIV_ASSIGN} {} assignment_operator {REM_ASSIGN} {} assignment_operator {PLUS_ASSIGN} {} assignment_operator {MINUS_ASSIGN} {} assignment_operator {LSHIFT_ASSIGN} {} assignment_operator {RSHIFT_ASSIGN} {} assignment_operator {BITAND_ASSIGN} {} assignment_operator {BITEOR_ASSIGN} {} assignment_operator {BITOR_ASSIGN} {} expression {assignment_expression} {} expression {expression COMMA assignment_expression} {} constant_expression {conditional_expression} {} declaration {declaration_specifiers SEMICOLON} {} declaration {declaration_specifiers init_declarator_list SEMICOLON} {} declaration_specifiers {storage_class_specifier} {} declaration_specifiers {storage_class_specifier declaration_specifiers} {} declaration_specifiers {type_specifier} {} declaration_specifiers {type_specifier declaration_specifiers} {} declaration_specifiers {type_qualifier} {} declaration_specifiers {type_qualifier declaration_specifiers} {} init_declarator_list {init_declarator} {} init_declarator_list {init_declarator_list COMMA init_declarator} {} init_declarator {declarator} {} init_declarator {declarator ASSIGN initializer} {} storage_class_specifier {TYPEDEF} {} storage_class_specifier {EXTERN} {} storage_class_specifier {STATIC} {} storage_class_specifier {AUTO} {} storage_class_specifier {REGISTER} {} type_specifier {VOID} {} type_specifier {CHAR} {} type_specifier {SHORT} {} type_specifier {INT} {} type_specifier {LONG} {} type_specifier {FLOAT} {} type_specifier {DOUBLE} {} type_specifier {SIGNED} {} type_specifier {UNSIGNED} {} type_specifier {struct_or_union_specifier} {} type_specifier {enum_specifier} {} type_specifier {TYPE_NAME} {} struct_or_union_specifier {struct_or_union IDENTIFIER LBRACE struct_declaration_list RBRACE} {} struct_or_union_specifier {struct_or_union LBRACE struct_declaration_list RBRACE} {} struct_or_union_specifier {struct_or_union IDENTIFIER} {} struct_or_union {STRUCT} {} struct_or_union {UNION} {} struct_declaration_list {struct_declaration} {} struct_declaration_list {struct_declaration_list struct_declaration} {} struct_declaration {specifier_qualifier_list struct_declarator_list SEMICOLON} {} specifier_qualifier_list {type_specifier specifier_qualifier_list} {} specifier_qualifier_list {type_specifier} {} specifier_qualifier_list {type_qualifier specifier_qualifier_list} {} specifier_qualifier_list {type_qualifier} {} struct_declarator_list {struct_declarator} {} struct_declarator_list {struct_declarator_list COMMA struct_declarator} {} struct_declarator {declarator} {} struct_declarator {COLON constant_expression} {} struct_declarator {declarator COLON constant_expression} {} enum_specifier {ENUM LBRACE enumerator_list RBRACE} {} enum_specifier {ENUM IDENTIFIER LBRACE enumerator_list RBRACE} {} enum_specifier {ENUM IDENTIFIER} {} enumerator_list {enumerator} {} enumerator_list {enumerator_list COMMA enumerator} {} enumerator {IDENTIFIER} {} enumerator {IDENTIFIER ASSIGN constant_expression} {} type_qualifier {CONST} {} type_qualifier {VOLATILE} {} declarator {pointer direct_declarator} {} declarator {direct_declarator} {} direct_declarator {IDENTIFIER} {} direct_declarator {LPAREN declarator RPAREN} {} direct_declarator {direct_declarator LBRACKET constant_expression RBRACKET} {} direct_declarator {direct_declarator LBRACKET RBRACKET} {} direct_declarator {direct_declarator LPAREN parameter_type_list RPAREN} {} direct_declarator {direct_declarator LPAREN identifier_list RPAREN} {} direct_declarator {direct_declarator LPAREN RPAREN} {} pointer {MULT_STAR} {} pointer {MULT_STAR type_qualifier_list} {} pointer {MULT_STAR pointer} {} pointer {MULT_STAR type_qualifier_list pointer} {} type_qualifier_list {type_qualifier} {} type_qualifier_list {type_qualifier_list type_qualifier} {} parameter_type_list {parameter_list} {} parameter_type_list {parameter_list COMMA ELLIPSIS} {} parameter_list {parameter_declaration} {} parameter_list {parameter_list COMMA parameter_declaration} {} parameter_declaration {declaration_specifiers declarator} {} parameter_declaration {declaration_specifiers abstract_declarator} {} parameter_declaration {declaration_specifiers} {} identifier_list {IDENTIFIER} {} identifier_list {identifier_list COMMA IDENTIFIER} {} type_name {specifier_qualifier_list} {} type_name {specifier_qualifier_list abstract_declarator} {} abstract_declarator {pointer} {} abstract_declarator {direct_abstract_declarator} {} abstract_declarator {pointer direct_abstract_declarator} {} direct_abstract_declarator {LPAREN abstract_declarator RPAREN} {} direct_abstract_declarator {LBRACKET RBRACKET} {} direct_abstract_declarator {LBRACKET constant_expression RBRACKET} {} direct_abstract_declarator {direct_abstract_declarator LBRACKET RBRACKET} {} direct_abstract_declarator {direct_abstract_declarator LBRACKET constant_expression RBRACKET} {} direct_abstract_declarator {LPAREN RPAREN} {} direct_abstract_declarator {LPAREN parameter_type_list RPAREN} {} direct_abstract_declarator {direct_abstract_declarator LPAREN RPAREN} {} direct_abstract_declarator {direct_abstract_declarator LPAREN parameter_type_list RPAREN} {} initializer {assignment_expression} {} initializer {LBRACE initializer_list RBRACE} {} initializer {LBRACE initializer_list COMMA RBRACE} {} initializer_list {initializer} {} initializer_list {initializer_list COMMA initializer} {} statement {labeled_statement} {} statement {compound_statement} {} statement {expression_statement} {} statement {selection_statement} {} statement {iteration_statement} {} statement {jump_statement} {} labeled_statement {IDENTIFIER COLON statement} {} labeled_statement {CASE constant_expression COLON statement} {} labeled_statement {DEFAULT COLON statement} {} compound_statement {LBRACE RBRACE} {} compound_statement {LBRACE statement_list RBRACE} {} compound_statement {LBRACE declaration_list RBRACE} {} compound_statement {LBRACE declaration_list statement_list RBRACE} {} declaration_list {declaration} {} declaration_list {declaration_list declaration} {} statement_list {statement} {} statement_list {statement_list statement} {} expression_statement {SEMICOLON} {} expression_statement {expression SEMICOLON} {} selection_statement {IF LPAREN expression RPAREN statement} {} selection_statement {IF LPAREN expression RPAREN statement ELSE statement} {} selection_statement {SWITCH LPAREN expression RPAREN statement} {} iteration_statement {WHILE LPAREN expression RPAREN statement} {} iteration_statement {DO statement WHILE LPAREN expression RPAREN SEMICOLON} {} iteration_statement {FOR LPAREN expression_statement expression_statement RPAREN statement} {} iteration_statement {FOR LPAREN expression_statement expression_statement expression RPAREN statement} {} jump_statement {GOTO IDENTIFIER SEMICOLON} {} jump_statement {CONTINUE SEMICOLON} {} jump_statement {BREAK SEMICOLON} {} jump_statement {RETURN SEMICOLON} {} jump_statement {RETURN expression SEMICOLON} {} translation_unit {external_declaration} {} translation_unit {translation_unit external_declaration} {} external_declaration {function_definition} {} external_declaration {declaration} {} function_definition {declaration_specifiers declarator declaration_list compound_statement} {} function_definition {declaration_specifiers declarator compound_statement} {} function_definition {declarator declaration_list compound_statement} {} function_definition {declarator compound_statement} {} } # Dump the generated parser. set f [open cparser.tcl w] puts $f [$pg dump] close $f delete object $pg # Done. exit ---- [[ [Lexing C] ]] <-- [[ [Scripted Compiler] ]] ---- [GPS]: This is an interesting project. What's next? Is a project page available for this somewhere(SF?)?