Version 1 of Parsing Excel XLSX

Updated 2011-02-15 02:49:55 by jbr

jbr - 2011/02/14 (Happy Valentines Day)

Here is some code I've written to parse and evaluate formula in an Excel 2008 xml format spreadsheet. I generate some very complex reports in Excel format using perl (XLSX:Excel:Writer Using Perl to get Excel), so my xlsx output files do not have the values precomputed for each cell the way excel writes its output by default. To allow testing and verification of values that I'm pushing out to the client I need to be able to evaluate the formula that I've written. This is a rather long story but it starts with an example of using the Tcl Parser Tools from tcllib:

xls-parser.peg:


 PEG xlsexpr (Formula)
        Formula    <- Expr EOF ;

        MulOp   <- '<' / '>' / '=' / '*' / '/'          ;
        Expo    <- Value ('^' Value)?                   ;
        Prod    <- Expo WS (MulOp WS Expo)*                     ;
        AddOp   <- '^' / '+' / '-'                      ;
        Sum     <- Prod WS (AddOp WS Prod)*                     ;
        Expr    <- Sum                                  ;

 void:   WS      <- <space>*                             ;

        Value       <- UnOp? ( '(' Expr ')' / String / Func / Range / Cell / Number )   ;

        UnOp        <- '-' / '+'                        ;


        Func    <- FunName '(' WS FunArgs WS ')'                ;
        FunName <- <alpha><alnum>*                      ;
        FunArgs <- Expr WS (',' WS Expr)*                       ;

        Range      <- Sheet? RowCol ':' RowCol          ;
        Cell       <- Sheet? Cell_                      ;
        Sheet      <- "'"? Sheet_ "'"? '!'              ;
        Sheet_     <- [-A-Za-z0-9& ]+                   ;
        Cell_      <- <alpha>+<ddigit>+                 ;
        RowCol     <- Col_ Row_                         ;
        Col_       <- <alpha>+                          ;
        Row_       <- <ddigit>+                         ;

        String     <- '"' [A-Za-z0-9]* '"'                      ;

 leaf:   Number  <- Sign? ( <ddigit>+ Frac? ) / Frac?    ;
        Sign    <- '-' / '+'                            ;
        Frac    <- '.' <ddigit>*                        ;

        EOF        <- !.                                ;
 END;

Here is a Makefile to generate a parser for Excel expressions. I wanted parser tool to generate a parser for TclOO but the evaluator for oo:: failed so I ended up using snit. The sed command fixes up a little bug in the parser tool output:

Makefile


 PT=pt

 parser : xls-parser.peg
        $(PT) generate snit -class xls-parser  -name xls-parser xls-parser.tcl peg xls-parser.peg
        sed -e s/PACKAGE/xls-parser/ < xls-parser.tcl > tmp
        mv tmp xls-parser.tcl

This is the top level driver program. The interesting command here is "compare". It loads the same spreadsheet twice, clearing the default cell values from one copy and then comparing the values (left from excel) with the tcl evaluated values of the cells in the second (cleared) copy.

xlsx


 #!/bin/env tclkit8.6
 #

 source  xml.tcl   ; # this is tax
 source xlsx.tcl

 package require vfs
 package require vfs::zip

 set argv [lassign $argv op]

 switch $op {
 form {
    lassign $argv file sheet cell

    workbook create wb $file
    puts [[wb name2obj $sheet] form $cell]
 }
 cell {
    lassign $argv file sheet cell

    workbook create wb $file
    puts [[wb name2obj $sheet] cell $cell]
 }
 cell+ {
    lassign $argv file sheet cell

    workbook create wb $file
    wb clear
    puts [[wb name2obj $sheet] cell? $cell]
 }
 cells {        workbook create wb [lindex $argv 0]
                foreach name [wb sheets] {
                    set sheet [wb name2obj $name]
                    puts [list $sheet [$sheet cells]]
                }
 }

 compare {
    lassign $argv file1 file2

    workbook create wb1 $file1
    workbook create wb2 $file2

    wb2 clear

    foreach name1 [wb1 sheets] name2 [wb2 sheets] {
        if { $name1 ne $name2 } {
            puts "Sheet names don't match $name1, $name2"
        }

        set sh1 [wb1 name2obj $name1]
        set sh2 [wb2 name2obj $name2]

        foreach cell [$sh1 cells] {
            #puts "$name1 $cell"
            if { [set v1 [$sh1 cell $cell]] != [set v2 [$sh2 cell $cell]] } {
                puts "$name1 : $cell $v1 != $v2"
            }
        }
    }
 }
 }

Here is the meat of the code. Included are methods to open the xlsx file and parse the cell values, formula and formats from the xml into instance variable arrays in the xlsx object. The "=" method evaluates the value of the cell including following a formula right through all the references in the spreadsheet. Each formula value is cached and evaluated only once. Formula are evaluated by parsing them into AST format with parser tools and then executing the AST as a script. The result of the script is an expression suitable for expr, which is then called to obtain the cell's value. Just enough stuff is implemented here to support the syntax and functions in my spreadsheets, but extending this should be straight forward. I have evaluated workbooks with multiple worksheets, complex formula across thousands of cells with perfect agreement to the values that excel computes itself.

The parser is fed using tcl::chan::string which needs to be patched. The Allowance method is broken so I just added a "return" at the start of virtchannel_core/events.tcl:Allowance to disable any checking. This checking seemed like overkill to me anyway.

Its almost 500 lines so I'll just reference it here: http://rkroll.com/tclwiki/xlsx.tcl

Enjoy