Aug 17, 2002 -- [AK] | [[ [Lexing C] ]] <-- [[ [Scripted Compiler] ]]
----
Based upon [Lexing C] this page introduces the main components of a
parser for the [C language]. This page is part of the [Scripted Compiler]
project.

The lexer shown here is a variant of the original lexer defined in
[Lexing C].  The changes are:

   * It ignores C comments.

   * Keywords and punctuation are converted into true lexer symbols.

   * All internal lexer symbols can be distinguished through their first character. Keywords and punctuation are for example introduced by \001.

   * The list generated by the older lexer is not directly used as result, but processed again, into a list of pairs, lexer symbols and their associated lexer data (value of literals and constants, the name of an identifier, ...).

Beyond the lexer you will find a grammar definition for C useable by
[Yeti]. This definition is based upon
[http://www.vendian.org/mncharity/ccode/grammar/local/degener_symb.txt].
Running it through Yeti will generate a LALR(1) shift/reduce parser
for C.

Not shown on this page:

   * The glue between the token list and the yeti-parser. This scanner has to perform one additional transformation, the conversion of identifiers for type names into the symbol TYPE_NAME. Why this has to be done is explained in the section defining the grammar.

   * The code to execute when rules are reduced by the parser. This code has to maintain at least a symbol-table, to support the lexer in his task, see above. Beyond that nearly anything will go. This could be a direct translation to machine code, the generation of an exact parse tree, generation of an abstract syntax tree, or whatever your imagination will come up with.

   * The generated parser. It comes in at 105 Kilobyte of Tcl code and is thus much to large for this page. And it can be generated anyways via [Yeti], so there is really no reason to place it here.

[AK] - Aug 23, 2002:  More notes: The reinsertion of strings into the source, before the listification, is sub-optimal. We should keep the ''short'' id;s and insert them in a postprocessing pass of the list. Thay we keep the constructed objects, don't shift long character sequences around the memory, ... Guess I'll have to [time] things.

----
'''clex.tcl'''

 # -*- tcl -*-
 # Lexing C

 package provide clex 1.0
 namespace eval  clex {}

 # Most tokens can be recognized and separated from the others via
 # [string map].  The remainder are identifiers.  Except for strings,
 # and comments. These may contain any other token, and we may not act
 # upon them. And both may contain sequences reminiscent of the
 # other. Because of that a first step is to identify and parse out
 # comments and strings, where 'parsing out' means that these tokens
 # are replaced with special char sequences which refer the lexer to a
 # small database. In the final output they are placed back.

 proc clex::SCextract {code mapvar} {
     # Extract strings and comments from the code and place them in mapvar.
     # Replace them with ids

     upvar $mapvar map

     set tmp   ""
     set count 0

     while {1} {
 	set comment [string first "/*" $code]
 	set string  [string first "\"" $code]

 	if {($comment < 0) && ($string < 0)} {
 	    # No comments nor strings to extract anymore.
 	    break
 	}

 	# Debug out
 	#puts "$string | $comment | [string length $code]"

	if {
	    (($string >= 0) && ($comment >= 0) && ($string < $comment)) ||
	    (($string >= 0) && ($comment < 0))
	} {
	    # The next vari-sized thing is a "-quoted string.
	    # Finding its end is bit more difficult, because we have
	    # to accept \" as one character inside of the string.

	    set from $string
	    while 1 {
		incr from
		set stop  [string first "\""   $code $from]
		set stopb [string first "\\\"" $code $from]
		incr stopb
		if {$stop == $stopb} {set from $stopb ; incr from ; continue}
		break
	    }

	    set id \000@$count\000
	    incr count
	    lappend map $id [set sub [string range $code $string $stop]]

	    incr stop ; incr string -1
	    append tmp [string range $code 0 $string] $id
	    set  code  [string range $code $stop end]

	    # Debug out
	    #puts "\tSTR $id <$sub>"
	    continue
	}

	if {
	    (($string >= 0) && ($comment >= 0) && ($comment < $string)) ||
	    (($comment >= 0) && ($string < 0))
	} {
	    # The next vari-sized thing is a comment.
	    # We ignore comments.

	    set  stop [string first "*/" $code $comment]
	    incr stop 2
	    incr comment -1
	    append tmp [string range $code 0 $comment]
	    set  code  [string range $code $stop  end]
	    continue
	}
	return -code error "Panic, string and comment at some location"
    }
    append  tmp $code
    return $tmp
 }

 proc clex::DefStart {} {
    variable tokens [list]
    #puts "== <$tokens>"
    return
 }

 proc clex::Key {string {replacement {}}} {
    variable tokens
    if {$replacement == {}} {
	set replacement \000\001[string toupper $string]\000
    } else {
	set replacement \000\001$replacement\000
    }
    lappend tokens $string $replacement
    #puts "== <$tokens>"
    return
 }

 proc clex::DefEnd {} {
    variable       tokens
    array set tmp $tokens
    set res [list]
    foreach key [lsort -decreasing [array names tmp]] {
	lappend res $key $tmp($key)
    }
    set tokens $res
    #puts "== <$tokens>"
    return
 }

 proc clex::lex {code} {
    variable tokens

    # Phase I ... Extract strings and comments so that they don't interfere
    #             with the remaining phases.

    # Phase II ... Separate all constant-sized tokens (keywords and
    #              punctuation) from each other.

    # Phase III ... Separate whitespace from the useful text.
    #               Actually converts whitespace into separator characters.

    # Phase IV ... Reinsert extracted tokens and cleanup multi-separator sequences

    set scmap [list]
    if 0 {
	# Minimal number of commands for all phases

	regsub -all -- "\[\t\n \]+" [string map $tokens \
		[SCextract $code scmap]] \
		\000 tmp

	set code [split \
		[string trim \
		[string map "\000\000\000 \000 \000\000 \000" \
		[string map $scmap \
		$tmp]] \000] \000]
    }
    if 1 {
	# Each phase spelled out explicitly ...

	set code [SCextract          $code scmap]    ; # I
	set code [string map $tokens $code]          ; # II
	regsub -all -- "\[\t\n \]+"  $code \000 code ; # III
	set code [string map $scmap  $code]          ; # IV/a
	set code [string map "\000\000\000 \000 \000\000 \000" $code]  ; # IV/b
	set code [string trim $code \000]
	set code [split $code \000]
    }

    # Run through the list and create something useable by the parser.
    #
    # A list of pairs (pairs being lists of 2 elements), where each
    # pair contains the symbol to give to the parser, and associated
    # data, if any.

    set tmp [list]
    foreach lex $code {
	switch -glob -- [string index $lex 0] {
	    \001 {
		# Keyword, no data.
		lappend tmp [list [string range $lex 1 end] {}]
	    }
	    ' - [0-9] {
		# Character or numeric constant.
		lappend tmp [list CONSTANT $lex]
	    }
	    \" {
		# String literal. Strip the double-quotes.
		lappend tmp [list STRING_LITERAL [string range $lex 1 end-1]]
	    }
	    default {
		# Identifier. This code does not distinguish
		# identifiers and type-names yet. This is defered to
		# the 'scanner', i.e. the glue code feeding the lexer
		# symbols into the parser.

		lappend tmp [list IDENTIFIER $lex]
	    }
	}
    }
    set code $tmp

    return $code
 }

 namespace eval clex {
    DefStart

    Key (   LPAREN      ; Key )  RPAREN    ; Key ->  DEREF
    Key <   LT          ; Key <= LE        ; Key ==  EQ
    Key >   GT          ; Key >= GE        ; Key !=  NE
    Key \[  LBRACKET    ; Key \] RBRACKET  ; Key =   ASSIGN
    Key \{  LBRACE      ; Key \} RBRACE    ; Key *=  MUL_ASSIGN
    Key .   DOT         ; Key ,  COMMA     ; Key /=  DIV_ASSIGN
    Key ++  INCR_OP     ; Key -- DECR_OP   ; Key %=  REM_ASSIGN
    Key &   ADDR_BITAND ; Key *  MULT_STAR ; Key +=  PLUS_ASSIGN
    Key +   PLUS        ; Key -  MINUS     ; Key -=  MINUS_ASSIGN
    Key ~   BITNOT      ; Key !  LOGNOT    ; Key <<= LSHIFT_ASSIGN
    Key /   DIV         ; Key %  REM       ; Key >>= RSHIFT_ASSIGN
    Key <<  LSHIFT      ; Key >> RSHIFT    ; Key &=  BITAND_ASSIGN
    Key ^   BITEOR      ; Key && LOGAND    ; Key ^=  BITEOR_ASSIGN
    Key |   BITOR       ; Key || LOGOR     ; Key |=  BITOR_ASSIGN  
    Key ?   QUERY       ; Key :  COLON     ; Key \;  SEMICOLON
    Key ... ELLIPSIS

    Key typedef ; Key extern   ; Key static ; Key auto ; Key register
    Key void    ; Key char     ; Key short  ; Key int  ; Key long
    Key float   ; Key double   ; Key signed ; Key unsigned
    Key goto    ; Key continue ; Key break  ; Key return
    Key case    ; Key default  ; Key switch
    Key struct  ; Key union    ; Key enum
    Key while   ; Key do       ; Key for
    Key const   ; Key volatile
    Key if      ; Key else
    Key sizeof

    DefEnd
 }

----
'''cc.tcl'''

 # -*- tcl -*-
 # C parser.

 lappend auto_path [file join [file dirname [info script]]]

 package require yeti

 set pg [yeti::yeti \#auto -name CParser]

 ## Notes
 #
 ## The lexem TYPE_NAME is from the pure lexers point of view the same
 ## as IDENTIFIER, i.e. both have the same lexer definition. The
 ## information to distinguish between both of them comes from the
 ## parser stage, or rather from the symbol table it manages. This
 ## means that the lexer will lookup any IDENTIFIER in the symbol table
 ## and if that identifier is associated with a type it will return the
 ## symbol TYPE_NAME instead of IDENTIFIER.
 #
 ## Why this distinction ?
 #
 ## Because the C grammar has an inherent ambiguity in several places
 ## where a shift/reduce parser will be unable to select the correct
 ## reduction if no distinction is made between identifiers
 ## representing type names and all other identifiers.
 #
 ## The relevant pairs of LR(0) items are listed below. No distinction
 ## is possible if only IDENTIFIER is used in the grammar.
 #
 ##    type_specifier     --> TYPE_NAME  .
 ##    primary_expression --> IDENTIFIER .
 #
 ##    type_specifier     --> TYPE_NAME  .
 ##    identifier_list    --> IDENTIFIER .
 #
 ##    type_specifier     --> TYPE_NAME  .
 ##    direct_declarator  --> IDENTIFIER .
 
 $pg configure -start translation_unit -verbose 2
 $pg add {
    primary_expression {IDENTIFIER}               {}
    primary_expression {CONSTANT}                 {}
    primary_expression {STRING_LITERAL}           {}
    primary_expression {LPAREN expression RPAREN} {}

    postfix_expression {primary_expression} {}
    postfix_expression {postfix_expression LBRACKET expression RBRACKET} {}
    postfix_expression {postfix_expression LPAREN RPAREN} {}
    postfix_expression {postfix_expression LPAREN argument_expression_list RPAREN} {}
    postfix_expression {postfix_expression DOT IDENTIFIER} {}
    postfix_expression {postfix_expression DEREF IDENTIFIER} {}
    postfix_expression {postfix_expression INCR_OP} {}
    postfix_expression {postfix_expression DECR_OP} {}

    argument_expression_list {assignment_expression}                                {}
    argument_expression_list {argument_expression_list COMMA assignment_expression} {}

    unary_expression {postfix_expression}             {}
    unary_expression {INCR_OP unary_expression}       {}
    unary_expression {DECR_OP unary_expression}       {}
    unary_expression {unary_operator cast_expression} {}
    unary_expression {SIZEOF unary_expression}        {}
    unary_expression {SIZEOF LPAREN type_name RPAREN} {}

    unary_operator {ADDR_BITAND} {}
    unary_operator {STAR} {}
    unary_operator {PLUS} {}
    unary_operator {MINUS} {}
    unary_operator {BITNOT} {}
    unary_operator {LOGNOT} {}

    cast_expression {unary_expression} {}
    cast_expression {LPAREN type_name RPAREN cast_expression} {}

    multiplicative_expression {cast_expression} {}
    multiplicative_expression {multiplicative_expression MULT_STAR cast_expression} {}
    multiplicative_expression {multiplicative_expression DIV cast_expression} {}
    multiplicative_expression {multiplicative_expression REM cast_expression} {}

    additive_expression {multiplicative_expression} {}
    additive_expression {additive_expression PLUS multiplicative_expression} {}
    additive_expression {additive_expression MINUS multiplicative_expression} {}

    shift_expression {additive_expression} {}
    shift_expression {shift_expression LSHIFT additive_expression} {}
    shift_expression {shift_expression RSHIFT additive_expression} {}

    relational_expression {shift_expression} {}
    relational_expression {relational_expression LT shift_expression} {}
    relational_expression {relational_expression GT shift_expression} {}
    relational_expression {relational_expression LE shift_expression} {}
    relational_expression {relational_expression GE shift_expression} {}

    equality_expression {relational_expression} {}
    equality_expression {equality_expression EQ relational_expression} {}
    equality_expression {equality_expression NE relational_expression} {}

    and_expression {equality_expression} {}
    and_expression {and_expression ADDR_BITAND equality_expression} {}

    exclusive_or_expression {and_expression} {}
    exclusive_or_expression {exclusive_or_expression BITEOR and_expression} {}

    inclusive_or_expression {exclusive_or_expression} {}
    inclusive_or_expression {inclusive_or_expression BITOR exclusive_or_expression} {}

    logical_and_expression {inclusive_or_expression} {}
    logical_and_expression {logical_and_expression LOGAND inclusive_or_expression} {}

    logical_or_expression {logical_and_expression} {}
    logical_or_expression {logical_or_expression LOGOR logical_and_expression} {}

    conditional_expression {logical_or_expression} {}
    conditional_expression {logical_or_expression QUERY expression COLON conditional_expression} {}

    assignment_expression {conditional_expression} {}
    assignment_expression {unary_expression assignment_operator assignment_expression} {}

    assignment_operator {ASSIGN} {}
    assignment_operator {MUL_ASSIGN} {}
    assignment_operator {DIV_ASSIGN} {}
    assignment_operator {REM_ASSIGN} {}
    assignment_operator {PLUS_ASSIGN} {}
    assignment_operator {MINUS_ASSIGN} {}
    assignment_operator {LSHIFT_ASSIGN} {}
    assignment_operator {RSHIFT_ASSIGN} {}
    assignment_operator {BITAND_ASSIGN} {}
    assignment_operator {BITEOR_ASSIGN} {}
    assignment_operator {BITOR_ASSIGN} {}

    expression {assignment_expression} {}
    expression {expression COMMA assignment_expression} {}

    constant_expression {conditional_expression} {}

    declaration {declaration_specifiers SEMICOLON} {}
    declaration {declaration_specifiers init_declarator_list SEMICOLON} {}

    declaration_specifiers {storage_class_specifier} {}
    declaration_specifiers {storage_class_specifier declaration_specifiers} {}
    declaration_specifiers {type_specifier} {}
    declaration_specifiers {type_specifier declaration_specifiers} {}
    declaration_specifiers {type_qualifier} {}
    declaration_specifiers {type_qualifier declaration_specifiers} {}

    init_declarator_list {init_declarator} {}
    init_declarator_list {init_declarator_list COMMA init_declarator} {}

    init_declarator {declarator} {}
    init_declarator {declarator ASSIGN initializer} {}

    storage_class_specifier {TYPEDEF} {}
    storage_class_specifier {EXTERN} {}
    storage_class_specifier {STATIC} {}
    storage_class_specifier {AUTO} {}
    storage_class_specifier {REGISTER} {}

    type_specifier {VOID} {}
    type_specifier {CHAR} {}
    type_specifier {SHORT} {}
    type_specifier {INT} {}
    type_specifier {LONG} {}
    type_specifier {FLOAT} {}
    type_specifier {DOUBLE} {}
    type_specifier {SIGNED} {}
    type_specifier {UNSIGNED} {}
    type_specifier {struct_or_union_specifier} {}
    type_specifier {enum_specifier} {}
    type_specifier {TYPE_NAME} {}

    struct_or_union_specifier {struct_or_union IDENTIFIER LBRACE struct_declaration_list RBRACE} {}
    struct_or_union_specifier {struct_or_union LBRACE struct_declaration_list RBRACE} {}
    struct_or_union_specifier {struct_or_union IDENTIFIER} {}

    struct_or_union {STRUCT} {}
    struct_or_union {UNION} {}

    struct_declaration_list {struct_declaration} {}
    struct_declaration_list {struct_declaration_list struct_declaration} {}

    struct_declaration {specifier_qualifier_list struct_declarator_list SEMICOLON} {}

    specifier_qualifier_list {type_specifier specifier_qualifier_list} {}
    specifier_qualifier_list {type_specifier} {}
    specifier_qualifier_list {type_qualifier specifier_qualifier_list} {}
    specifier_qualifier_list {type_qualifier} {}

    struct_declarator_list {struct_declarator} {}
    struct_declarator_list {struct_declarator_list COMMA struct_declarator} {}

    struct_declarator {declarator} {}
    struct_declarator {COLON constant_expression} {}
    struct_declarator {declarator COLON constant_expression} {}

    enum_specifier {ENUM LBRACE enumerator_list RBRACE} {}
    enum_specifier {ENUM IDENTIFIER LBRACE enumerator_list RBRACE} {}
    enum_specifier {ENUM IDENTIFIER} {}

    enumerator_list {enumerator} {}
    enumerator_list {enumerator_list COMMA enumerator} {}

    enumerator {IDENTIFIER} {}
    enumerator {IDENTIFIER ASSIGN constant_expression} {}

    type_qualifier {CONST} {}
    type_qualifier {VOLATILE} {}

    declarator {pointer direct_declarator} {}
    declarator {direct_declarator} {}

    direct_declarator {IDENTIFIER} {}
    direct_declarator {LPAREN declarator RPAREN} {}
    direct_declarator {direct_declarator LBRACKET constant_expression RBRACKET} {}
    direct_declarator {direct_declarator LBRACKET RBRACKET} {}
    direct_declarator {direct_declarator LPAREN parameter_type_list RPAREN} {}
    direct_declarator {direct_declarator LPAREN identifier_list RPAREN} {}
    direct_declarator {direct_declarator LPAREN RPAREN} {}

    pointer {MULT_STAR} {}
    pointer {MULT_STAR type_qualifier_list} {}
    pointer {MULT_STAR pointer} {}
    pointer {MULT_STAR type_qualifier_list pointer} {}

    type_qualifier_list {type_qualifier} {}
    type_qualifier_list {type_qualifier_list type_qualifier} {}

    parameter_type_list {parameter_list} {}
    parameter_type_list {parameter_list COMMA ELLIPSIS} {}

    parameter_list {parameter_declaration} {}
    parameter_list {parameter_list COMMA parameter_declaration} {}

    parameter_declaration {declaration_specifiers declarator} {}
    parameter_declaration {declaration_specifiers abstract_declarator} {}
    parameter_declaration {declaration_specifiers} {}

    identifier_list {IDENTIFIER} {}
    identifier_list {identifier_list COMMA IDENTIFIER} {}

    type_name {specifier_qualifier_list} {}
    type_name {specifier_qualifier_list abstract_declarator} {}

    abstract_declarator {pointer} {}
    abstract_declarator {direct_abstract_declarator} {}
    abstract_declarator {pointer direct_abstract_declarator} {}

    direct_abstract_declarator {LPAREN abstract_declarator RPAREN} {}
    direct_abstract_declarator {LBRACKET RBRACKET} {}
    direct_abstract_declarator {LBRACKET constant_expression RBRACKET} {}
    direct_abstract_declarator {direct_abstract_declarator LBRACKET RBRACKET} {}
    direct_abstract_declarator {direct_abstract_declarator LBRACKET constant_expression RBRACKET} {}
    direct_abstract_declarator {LPAREN RPAREN} {}
    direct_abstract_declarator {LPAREN parameter_type_list RPAREN} {}
    direct_abstract_declarator {direct_abstract_declarator LPAREN RPAREN} {}
    direct_abstract_declarator {direct_abstract_declarator LPAREN parameter_type_list RPAREN} {}

    initializer {assignment_expression} {}
    initializer {LBRACE initializer_list RBRACE} {}
    initializer {LBRACE initializer_list COMMA RBRACE} {}

    initializer_list {initializer} {}
    initializer_list {initializer_list COMMA initializer} {}

    statement {labeled_statement} {}
    statement {compound_statement} {}
    statement {expression_statement} {}
    statement {selection_statement} {}
    statement {iteration_statement} {}
    statement {jump_statement} {}

    labeled_statement {IDENTIFIER COLON statement} {}
    labeled_statement {CASE constant_expression COLON statement} {}
    labeled_statement {DEFAULT COLON statement} {}

    compound_statement {LBRACE RBRACE} {}
    compound_statement {LBRACE statement_list RBRACE} {}
    compound_statement {LBRACE declaration_list RBRACE} {}
    compound_statement {LBRACE declaration_list statement_list RBRACE} {}

    declaration_list {declaration} {}
    declaration_list {declaration_list declaration} {}

    statement_list {statement} {}
    statement_list {statement_list statement} {}

    expression_statement {SEMICOLON} {}
    expression_statement {expression SEMICOLON} {}

    selection_statement {IF LPAREN expression RPAREN statement} {}
    selection_statement {IF LPAREN expression RPAREN statement ELSE statement} {}
    selection_statement {SWITCH LPAREN expression RPAREN statement} {}

    iteration_statement {WHILE LPAREN expression RPAREN statement} {}
    iteration_statement {DO statement WHILE LPAREN expression RPAREN SEMICOLON} {}
    iteration_statement {FOR LPAREN expression_statement expression_statement RPAREN statement} {}
    iteration_statement {FOR LPAREN expression_statement expression_statement expression RPAREN statement} {}

    jump_statement {GOTO IDENTIFIER SEMICOLON} {}
    jump_statement {CONTINUE SEMICOLON} {}
    jump_statement {BREAK SEMICOLON} {}
    jump_statement {RETURN SEMICOLON} {}
    jump_statement {RETURN expression SEMICOLON} {}

    translation_unit {external_declaration} {}
    translation_unit {translation_unit external_declaration} {}

    external_declaration {function_definition} {}
    external_declaration {declaration} {}

    function_definition {declaration_specifiers declarator declaration_list compound_statement} {}
    function_definition {declaration_specifiers declarator compound_statement} {}
    function_definition {declarator declaration_list compound_statement} {}
    function_definition {declarator compound_statement} {}
 }

 # Dump the generated parser.

 set    f [open cparser.tcl w]
 puts  $f [$pg dump]
 close $f
 delete object $pg

 # Done.
 exit

----
[[ [Lexing C] ]] <-- [[ [Scripted Compiler] ]]
----
[GPS]: This is an interesting project.  What's next?  Is a project page available for this somewhere(SF?)?