Version 5 of Parsing csv strings

Updated 2003-03-20 16:33:49

Note: This functionality was added to tcllib.


From a comp.lang.tcl discussion with Paulo da Silva and Ulrich Schoebel, brought here for persistence by RS:

1. A record of a csv file (comma-separated values, as exported e.g. by Excel) is a set of ascii values separated by "," (for other languages it may be ";" however, although this is not important for this case).

2. If a value contains itself the separator ",", then it (the value) is put between "".

3. If a value contains ", it is replaced by "".

Ex. The following record is parsed as follows:

     123,"123,521.2","Mary says ""Hello, I am Mary"""
         a) 123
         b) 123,521.2
         c) Mary says "Hello, I am Mary"

Ulrich Schoebel's solution:

 set orig_string {123,"123,521.2","Mary says ""Hello, I am Mary"""}

 proc csv:parse {line {sepa ,}} {
     set lst [split $line $sepa]
     set nlst {}
     set l [llength $lst]
     for {set i 0} {$i < $l} {incr i} {
         if {[string index [lindex $lst $i] 0] == "\""} {
            # start of a stringhttp://purl.org/thecliff/tcl/wiki/721.html
            if {[string index [lindex $lst $i] end] == "\""} {
               # check for completeness, on our way we repair double double quotes
               set c1 [string range [lindex $lst $i] 1 end]
               set n1 [regsub -all {""} $c1 {"} c2]
               set n2 [regsub -all {"} $c2 {"} c3]
               if {$n1 == $n2} {
                  # string extents to next list element
                  set new_el [join [lrange $lst $i [expr {$i + 1}]] $sepa]
                  set lst [lreplace $lst $i [expr {$i + 1}] $new_el]
                  incr i -1
                  incr l -1
                  continue
                  } else {
                  # we are done with this element
                  lappend nlst [string range $c2 0 [expr {[string length $c2] - 2}]]
                  continue
                  }
               } else {
               # string extents to next list element
               set new_el [join [lrange $lst $i [expr {$i + 1}]] $sepa]
               set lst [lreplace $lst $i [expr {$i + 1}] $new_el]
               incr i -1
               incr l -1
               continue
               }
            } else {
            # the most simple case
            lappend nlst [lindex $lst $i]
            continue
            }
         }
     return $nlst
 }
 # now testing...
 puts [csv:parse $orig_string]

RS: just introduced configurable separator (defaults to comma, as before)


Jeffrey Hobbs' solution (with sample cases):

 set str1 {"123","""a""",,hello}
 set str2 {1," o, ""a"" ,b ", 3}
 set str3 {"1"," o, "","" ,b ", 3}
 set str4 {1," foo,bar,baz", 3}
 set str5 {1,"""""a""""",b}

 proc csv2list {str {sepChar ,}} {
    regsub -all {(\A\"|\"\Z)} $str \0 str
    set str [string map [list $sepChar\"\"\" $sepChar\0\" \
                              \"\"\"$sepChar \"\0$sepChar \
                              \"\" \" \" \0 ] $str]
    set end 0
    while {[regexp -indices -start $end {(\0)[^\0]*(\0)} $str \
            -> start end]} {
        set start [lindex $start 0]
        set end   [lindex $end 0]
        set range [string range $str $start $end]
        set first [string first $sepChar $range]
        if {$first >= 0} {
            set str [string replace $str $start $end \
                [string map [list $sepChar \1] $range]]
        }
        incr end
    }
    set str [string map [list $sepChar \0 \1 $sepChar \0 {} ] $str]
    return [split $str \0]
 }

It makes use of 8.1.1+ string commands and is faster. Oh, and don't you find it ever more readable?? BTW, you can't skip the other direction:

 proc list2csv {list {sepChar ,}} {
    set out ""
    foreach l $list {
        set sep {}
        foreach val $l {
            if {[string match "*\[\"$sepChar\]*" $val]} {
                append out $sep\"[string map [list \" \"\"] $val]\"
            } else {
                append out $sep$val
            }
            set sep $sepChar
        }
        append out \n
    }
    return $out
 }

So to get an equivalent identity (remember, you can skip the quotes in some cases):

 list2csv [list [csv2list $str1] [csv2list $str2] [csv2list $str3] ...]

Laurent Duperval points out that the new tcllib contains a module for this purpose.

Larry Virden asks, "Where? What is its name? I have tcllib 0.8 and do not see it..."

LV: Andreas Kupries has added a csv package to the latest tcllib 0.9.


How are RETURN characters handled within Double Quotes?

How is abcd,abc",abc handled?

That looks ill-formed to me, so who cares how it is handled! :^)

How is abcd,abc"",abc handled?

By splitting into abcd abc\" abc of course.

LV: It is my opinion that whom ever left these questions was concerned about data that might be in the file with odd characteristics - a csv handling function needs to either return correct results or return some sort of error (preferably the first, not the second). Recommending that applications write thousands of lines of special handling to accomodate a general purpose function's lack of functionality is likely to result in the user moving on to some other language or package...


Larry Virden asks if anyone has compared the above in relationship to functionality? I also wonder about the questions right before this - seems like the first question is addressed by the first code; I don't know about the last two. Anyone know of a good set of test cases for cvs data?


Two csv-related applications Larry Virden would like to find (or break down and write if necesssary) are:

  • an application that would compare two csv files, with the ability to specify the numbers of the fields to compare
  • an application that would work like cut, only use csv field numbers (with the user definable csv character capability)

Thanks to AK, Tcllib 1.0 has example apps for csv and nntp. The csv apps include some benchmark apps, a csv to html converter, a csv cut, diff, join, sort and uniq app.


ZLM says: Excel uses the separator defined on Control Panel | Regional Options | Numbers | List Separator. "List Separator" cannot be the same as "Decimal Symbol" or else Excel will silently keep using the previous value. This is undocumented in the Excel Help system, of course. (07-01-2003)


Category Concept