Parsing Excel XLSX

jbr - 2011/02/14 (Happy Valentines Day)

Here is some code I've written to parse and evaluate formula in an Excel 2008 xml format spreadsheet. I generate some very complex reports in Excel format using perl (XLSX:Excel:Writer Using Perl to get Excel), so my xlsx output files do not have the values precomputed for each cell the way excel writes its output by default. To allow testing and verification of values that I'm pushing out to the client I need to be able to evaluate the formula that I've written. This is a rather long story but it starts with an example of using the Tcl Parser Tools from tcllib 1.13:

xls-parser.peg:


 PEG xlsexpr (Formula)
        Formula    <- Expr EOF ;

        MulOp   <- '<' / '>' / '=' / '*' / '/'          ;
        Expo    <- Value ('^' Value)?                   ;
        Prod    <- Expo WS (MulOp WS Expo)*                     ;
        AddOp   <- '^' / '+' / '-'                      ;
        Sum     <- Prod WS (AddOp WS Prod)*                     ;
        Expr    <- Sum                                  ;

 void:   WS      <- <space>*                             ;

        Value       <- UnOp? ( '(' Expr ')' / String / Func / Range / Cell / Number )   ;

        UnOp        <- '-' / '+'                        ;


        Func    <- FunName '(' WS FunArgs WS ')'                ;
        FunName <- <alpha><alnum>*                      ;
        FunArgs <- Expr WS (',' WS Expr)*                       ;

        Range      <- Sheet? RowCol ':' RowCol          ;
        Cell       <- Sheet? Cell_                      ;
        Sheet      <- "'"? Sheet_ "'"? '!'              ;
        Sheet_     <- [-A-Za-z0-9& ]+                   ;
        Cell_      <- <alpha>+<ddigit>+                 ;
        RowCol     <- Col_ Row_                         ;
        Col_       <- <alpha>+                          ;
        Row_       <- <ddigit>+                         ;

        String     <- '"' [A-Za-z0-9]* '"'                      ;

 leaf:   Number  <- Sign? ( <ddigit>+ Frac? ) / Frac?    ;
        Sign    <- '-' / '+'                            ;
        Frac    <- '.' <ddigit>*                        ;

        EOF        <- !.                                ;
 END;

Here is a Makefile to generate a parser for Excel expressions. I wanted parser tool to generate a parser for TclOO but the evaluator for oo:: failed so I ended up using snit. The sed command fixes up a little bug in the parser tool output:

Makefile


 PT=pt

 parser : xls-parser.peg
        $(PT) generate snit -class xls-parser  -name xls-parser xls-parser.tcl peg xls-parser.peg
        sed -e s/PACKAGE/xls-parser/ < xls-parser.tcl > tmp
        mv tmp xls-parser.tcl

This is the top level driver program. The interesting command here is "compare". It loads the same spreadsheet twice, clearing the excel provided values from any cell which contains a formula in one copy and then comparing the values with the tcl evaluated values of the cells.

xlsx


 #!/bin/env tclkit8.6
 #

 source  xml.tcl   ; # this is tax
 source xlsx.tcl

 package require vfs
 package require vfs::zip

 set argv [lassign $argv op]

 switch $op {
 form {
    lassign $argv file sheet cell

    workbook create wb $file
    puts [[wb name2obj $sheet] form $cell]
 }
 cell {
    lassign $argv file sheet cell

    workbook create wb $file
    puts [[wb name2obj $sheet] cell $cell]
 }
 cell+ {
    lassign $argv file sheet cell

    workbook create wb $file
    wb clear
    puts [[wb name2obj $sheet] cell? $cell]
 }
 cells {        workbook create wb [lindex $argv 0]
                foreach name [wb sheets] {
                    set sheet [wb name2obj $name]
                    puts [list $sheet [$sheet cells]]
                }
 }

 compare {
    lassign $argv file1 file2

    workbook create wb1 $file1
    workbook create wb2 $file2

    wb2 clear

    foreach name1 [wb1 sheets] name2 [wb2 sheets] {
        if { $name1 ne $name2 } {
            puts "Sheet names don't match $name1, $name2"
        }

        set sh1 [wb1 name2obj $name1]
        set sh2 [wb2 name2obj $name2]

        foreach cell [$sh1 cells] {
            #puts "$name1 $cell"
            if { [set v1 [$sh1 cell $cell]] != [set v2 [$sh2 cell $cell]] } {
                puts "$name1 : $cell $v1 != $v2"
            }
        }
    }
 }
 }

Down here in the guts of the thing are two Tcl gems.

These two features make taking apart an MS Excel xlsx file pretty simple.

Here is the meat of the code. Its almost 500 lines so I'll just reference it here: http://rkroll.com/tclwiki/xlsx.tcl

Included are methods to open the xlsx file and parse the cell values, formula and formats from the xml into instance variable arrays in the xlsx object. The "=" method evaluates the value of the cell including following a formula right through all the references in the spreadsheet. Each formula value is cached and evaluated only once. Formula are evaluated by parsing them into AST format with parser tools and then executing the AST as a script. The result of the script is an expression suitable for expr, which is then called to obtain the cell's value. Just enough stuff is implemented here to support the syntax and functions in my spreadsheets, but extending this should be straight forward. I have evaluated workbooks with multiple worksheets, complex formula across thousands of cells with perfect agreement to the values that excel computes itself.

These Excel functions are currently supported:

  • MAX
  • MIN
  • COUNT
  • SUM
  • AND
  • DAYS360
  • VLOOKUP
  • IF - This is handled specially in the AST expansion to short circuit.
  • YEAR
  • ROUND

The parser is fed using tcl::chan::string which needs a small patch. The Allowance method is broken so I just added a "return" at the start of virtchannel_core/events.tcl:Allowance to disable any checking. This checking seemed like overkill to me anyway.

AK: The checking is necessary. My 2009 Conference paper on Reflecting and Transforming Channels (at the TCA Conference Page ) explains these internals. The actual issue was a missing constructor chaining preventing the proper initialization of the event management core. Fixed in the source repository head.

Enjoy