Parsing csv file

gwlester 15-Mar-2011

The CSV package in TclLib does handle multiline quoted fields:

% package require csv
0.7.2
% set line  {a,"b1
b2",c}
a,"b1
b2",c
% set list [::csv::split $line]
a {b1
b2} c
% 

pi31415 18-Mar-2011

Some tcllib examples don't handle multiline quoted fields:

 $ cd examples/csv
 $ cat foo.csv
 a,"b1
 b2",c
 $ ./csv2html foo.csv
 <html><head><title>Title</title></head><body>
 <h1>Title</h1>
 <p><table border=1>
 <tr><td>a</td> <td>b1</td></tr>
 </table></p></body></html>

Line csv parser can't do proper handling of multiline quoted field.

Michael Heca's solution:

 proc csvParseFileCmd { file cmd {sep {,}} {quot {'}} } {

    # prepare regular expresions
    set rQuotedText [subst -nobackslashes -nocommands {(?:[^${quot}]|${quot}${quot})*}]
    set rQ  [subst -nobackslashes -nocommands {^[ \t]*${quot}}]
    set rQN [subst -nobackslashes -nocommands {^[ \t]*${quot}($rQuotedText)${quot}[ \t]*${sep}(.*)$}]
    set rQE [subst -nobackslashes -nocommands {^[ \t]*${quot}($rQuotedText)${quot}[ \t]*$}]
    set rQO [subst -nobackslashes -nocommands {^[ \t]*${quot}($rQuotedText)$}]
    set rON [subst -nobackslashes -nocommands {^($rQuotedText)${quot}[ \t]*${sep}(.*)$}]
    set rOE [subst -nobackslashes -nocommands {^($rQuotedText)${quot}[ \t]*$}]
    set rOX [subst -nobackslashes -nocommands {^$rQuotedText${quot}[^${quot}]}]
    set rUN [subst -nobackslashes -nocommands {^([^${sep}]*)${sep}(.*)$}]
    set qMap [list "${quot}${quot}" $quot]

    # parse whole file
    gets $file line
    while { $line != {} } {
        set list {}
        # parse one record
        while { $line != {} } {
            if { [regexp $rQ $line] } {
                # handle quoted fields
                if { [regexp $rQN $line match field rest] } {
                    # quoted with next
                    set field [string map $qMap $field]
                    set line $rest
                } elseif { [regexp $rQE $line match field rest] } {
                    # last quoted 
                    set field [string map $qMap $field]
                    set line {}
                } elseif { [regexp $rQO $line match field] } {
                    # open quoted
                    set field [string map $qMap $field]
                    gets $file line
                    while { 1 } {
                        if { [string first $quot $line] == -1 } {
                            # speed next line
                            append field $line
                            gets $file line
                        } elseif { [regexp $rON $line match next rest] } {
                            # end with next fields
                            append field "\n[string map $qMap $next]"
                            set line $rest
                            break
                        } elseif { [regexp $rOE $line match next] } {
                            # end
                            append field "\n[string map $qMap $next]"
                            set line {}
                            break
                        } elseif { [regexp $rOX $line] } {
                            error "Invalid CSV syntax.\nline=$line\nfield=$field"
                        } else {
                            # next line
                            append field "\n[string map $qMap $next]"
                            gets $file line
                        }
                    }
                } else {
                    error "Invalid CSV syntax.\nline=$line"
                }
            } else {
                # handle unquoted field
                if { [regexp $rUN $line match field rest] } {
                    # unquoted with next
                    set line $rest
                } else {
                    # unquoted last
                    set field $line
                    set line {}
                }
                # tream leading and trailing spaces
                set field [string trim $field]
            }
            # append parset field to record list
            lappend list $field
        }
        # call command
        eval $cmd [list $list]
        # get next record line
        gets $file line
    }
 }

File is read by line. For each csv record is called proc cmd with list of parsed fields as params.

Test example:

 proc csvDump { list } {
    global out
    puts $out [string map { \n "#" } [join $list |]]
 }

 proc test {} {
    global out
    set out [open test.out w]
    fconfigure $out -encoding iso8859-2
    set fd [open test.csv]
    fconfigure $fd -encoding cp1250
    csvParseFileCmd $fd csvDump , '
    close $fd
    close $out
 }

 test

Tested with 118MB csv file. On Athlon XP 1600/512MB RAM, Debian Woody/Tcl 8.4 take about 3 min 20 sec.


Maybe this is too simple but this has worked for me. If the line ends in a quoted field there will be an odd number of " 's. I'm deliberately not keeping the embedded line ends so it may not suit others' needs. The returned line can be parsed with tcllib's csv parser - VPT

 proc buildLine {ch} {
 
      set cnt 0
      set line {}
      while {![eof $ch] } {
          gets $ch linePart
          set cnt [expr $cnt + [regexp -all {"} $linePart]]
          append line $linePart
          if {$cnt % 2 == 0} break
      }
      return $line
 }

pi31415

Here is a brute force emitter and parser based on a script I wrote for a recent hobby project. csv::formatCsv expects a list of lists and produces CSV. csv::parseText expects CSV and produces a list of lists.

 # Code to emit or parse csv
 # See also: http://en.wikipedia.org/wiki/CSV_application_support
 # Uses Unicode.  For ASCII: http://billposer.org/Software/uni2ascii.html

 namespace eval csv {
     # leave eol \n and use translation to write other formats
     set equo {""}
     set eol "\n"
     set quo {"}
     set sep ","
 }
 
 proc csv::encodeLine {fields} {
     set field undefined
     set parts [list]
     set retval undefined
     foreach {field} $fields {
         set field [regsub -all $csv::quo $field $csv::equo]
         if {[regexp "\[$csv::sep\\s\]" $field]} {
             set field "$csv::quo$field$csv::quo"
         }
         lappend parts $field
     }
     set retval [join [list [join $parts $csv::sep] $csv::eol] ""]
     return $retval
 }
 
 proc csv::formatCsv {data} {
     set line undefined
     set lines [list]
     set retval undefined
     foreach {line} $data {
         lappend lines [encodeLine $line]
     }
     set retval [join $lines ""]
     return $retval
 }
 
 proc csv::parseText {text} {
     set c undefined
     set chars [list]
     set fields [list]
     set i undefined
     set parts [list]
     set quoted false
     set retval [list]
     set text [string map [list "\r" ""] $text]
     if {![regexp "$csv::eol\$" $text]} {
         append text $csv::eol
     }
 
     # Parse the content character by character
     for {set i 0} {$i < [string length $text]} {incr i} {
         set c [string index $text $i]
         if {$c eq $csv::quo} {
             set quoted [expr {!$quoted}]
         }
 
         if {$c eq $csv::eol && !$quoted} {
             # End of line
             lappend parts $chars
 
             foreach {chars} $parts {
                 if {[llength $chars] > 0} {
                     if {[lindex $chars 0] eq $csv::quo} {
                         # Remove enclosure delimiters
                         set chars [lrange $chars 1 "end-1"]
                     } elseif {[join [lrange $chars 0 1] ""] eq "=$csv::quo"} {
                         # Remove formula sigil for CSV_application_support
                         set chars [lrange $chars 2 "end-1"]
                     }
                 }
 
                 # dequote quotes
                 lappend fields [regsub -all \
                     $csv::equo              \
                     [join $chars ""]        \
                     $csv::quo               \
                 ]
             }
 
             lappend retval $fields
             set chars [list]
             set fields [list]
             set parts [list]
         } elseif {$c eq $csv::sep && !$quoted} {
             # End of field
             lappend parts $chars
             set chars [list]
         } else {
             # Inside the field
             lappend chars $c
         }
     }
     return $retval
 }

 set data [list                     \
     [list row row row "your boat"] \
     [list gently down the stream]  \
 ]
 puts "original data:\n$data\n"
 
 set text [csv::formatCsv $data]
 puts "text:\n$text"
 
 set newdata [csv::parseText $text]
 puts "newdata:\n$newdata"

I believe the following code parses csv correctly. Uses ReadChar.

 # Here paste ReadChar code by pmarin https://wiki.tcl-lang.org/25551

 proc readCSV {fileName matrixName {sepchar \t}} {
    upvar $matrixName matrix
    set h [open $fileName r]
    ReadChar create csv $h
    set cell {}
    set rowL [list]
    set rowNum 0
    set numCols 0
    set quoting 0
    set doublequotes 0
    while {![csv eof]} {
        set ch [csv getc]
        if {$ch eq "\r"} {
            if {!$quoting} {
                continue   ;# next char \n on Windows
            } else {
                append cell $ch
            }
        } elseif {$ch eq "\n"} {
            if {!$quoting} {  ;# regular line end: flush
                if {![llength $rowL] && ![string length [string trim $cell]]} {
                    # Spurious last line
                    continue
                }
                lappend rowL $cell
                if {[llength $rowL] != $numCols} {
                    if {$numCols == 0} {
                        set numCols [llength $rowL]
                    } else {
                        error "Record #$rowNum: length is [llength $rowL], should be $numCols"
                    }
                }
                lappend matrix $rowL
                set cell {}
                set rowL [list]
                incr rowNum
            } else {
                append cell $ch
            }
        } elseif {$ch eq $sepchar} {
            if {!$quoting} {
                lappend rowL $cell
                set cell {}
            } else {
                append cell $ch
            }
        } elseif {$ch eq "\""} {
            if {!$quoting} {
                set quoting 1
            } else {
                if {$doublequotes} {
                    append cell $ch
                    set doublequotes 0
                } elseif {[csv peek] eq "\""} {
                    set doublequotes 1
                } else {   ;# closing quotes
                    set quoting 0
                }
            }
        } else {
            append cell $ch
        }
    }
    if {$quoting} {
        error "Unexpected EOF"
    }
    csv destroy
 }

 # Example
 set sheet [list]
 readCSV "testglm.csv" sheet ","
 puts "[llength $sheet] rows x [llength [lindex $sheet 0]] columns"

See also csv

See also LogParser