csv

Comma Separated Values, or CSV, is a textual format for table/record data.

Description

The venerable CSV format has been used at least since the days of the first spreadsheets such as VisiCalc as a common export format for table data such as spreadsheets. It uses commas to separate fields, and native line-ending character(s) of the host system to separate records. Commas and newlines can be escaped with double quotes, which in turn can be escaped by prepending another double quote character to. Two consecutive commas indicate an empty field. These plain text files are a common way to export data from spreadsheets or row-based databases.

CSV started out as an ad-hoc format, and many ad-hoc flavours of CSV exist in the wild. The Microsoft Excel implementation is widely considered as the reference implementation.

Some locales require the use of other delimiter characters. In the German locales the semicolon is used in csv-files as field delimiter as the comma is used as decimal separator. Some applications permit allow the field and/or record delimiters to be specified, giving rise to the general term, Delimiter-Separated Values.

Some applications have special quoting defined for commas and line ending characters.

An example of a csv format:

Cabbage,1.49,1,Kroger
Celery,1.19,1,Meijer

Unlike the more popular today XML, the consuming application will be making a lot of assumptions about the record layout, element format, etc.

Where a comma character is part of the record value, it can be escaped using double quotes:

"Cream cheese, flavored",2.68,12,"Giant Eagle"

Some programs only quote values when necessary, but other programs quote all values containing commas or spaces, and some programs simply quote all values. Other programs don't support value quoting at all, and values containing comma or the record delimiter are not allowed.

ZB: That's why CSV "tab-delimited" variant is soooo much better... quite nothing to worry about, since there's no need to quote anything at all. AMG: Until you wind up with a tab or newline inside your data. The true test of any encoding is whether it can be nested, so picking different delimiters will not solve the problem, short of using MIME-style randomized boundaries.

The record delimiter can also be quoted:

"Cream
   cheese, flavored",2.68,12,"Giant Eagle"

CSV parsers, particularly quickly-coded, ad-hoc parsers, don't support such complexities.

Double quotes generally encapsulate an entire value, although parsers surely exist that interpret double quotes embedded in values.

A double quote that is part of a value can be escaped by prepending another double quote to it:

"""I can't believe it's not butter""",3.45,12,"Big Bear"

AMG: ... it will be necessary to double up quotes internal to the element even when the element isn't contained in quotes.

Some programs indicate an empty element by no data, while others indicate it by an empty pair of quotes:

Batteries,4.47,,Kroger
Batteries,4.47,"","Big Bear"

AMG: ... but this is ambiguous between an empty cell or a cell consisting of a single " character. Better to not use "" to indicate an empty. However, you're still in trouble if you want to distinguish between an empty row and a row consisting of a single empty cell.

ESR's criticism of CSV

AMG: The Art of Unix Programming , by Eric S Raymond, on CSV:

: "In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field."

Sigh!

Perhaps as a demonstration of how confusing CSV is, ESR is wrong to say that fields containing double quotes must themselves be enclosed in double quotes.

regexp-based CSV parser, by AMG

AMG: Here's a CSV splitter implemented using [regexp]:

proc csvRowSplit {row} {
    set result {}
    foreach {_ unquoted quoted} [regexp -all -inline -expanded {
        ,((?:[^,\"]|"")*)|,\"((?:[^\"]|"")*)\"
    } ,$row] {
        if {$quoted ne {}} {
            lappend result [string map {\"\" \"} $quoted]
        } else {
            lappend result [string map {\"\" \"} $unquoted]
        }
    }
    return $result
}

The following splits a file into a list of rows, where each row is in CSV format and ready to be split further by [csvRowSplit]:

proc csvFileSplit {file} {
    set result {}
    foreach row [regexp -all -inline -expanded {
        \n(?:|(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\")
          (?:,(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\"))*)
    } \n$file] {
        lappend result [string range $row 1 end]
    }
    return $result
}

Here is a combined form:

proc csvSplit {file} {
    set result {}
    foreach row [csvFileSplit $file] {
        lappend result [csvRowSplit $row]
    }
    return $result
}

Or using [lmap]:

proc csvSplit {file} {
    lmap row [csvFileSplit $file] {csvRowSplit $row}
}

Or spelled out:

proc csvSplit {file} {
    set fileResult {}
    foreach row [regexp -all -inline -expanded {
        \n(?:|(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\")
          (?:,(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\"))*)
    } \n$file] {
        set rowResult {}
        foreach {_ unquoted quoted} [regexp -all -inline -expanded {
            ,((?:[^,\"]|"")*)|,\"((?:[^\"]|"")*)\"
        } [string replace $row 0 0 ,]] {
            if {$quoted ne {}} {
                lappend rowResult [string map {\"\" \"} $quoted]
            } else {
                lappend rowResult [string map {\"\" \"} $unquoted]
            }
        }
        lappend fileResult $rowResult
    }
    return $fileResult
}

Character-by-character CSV Parser, by AMG

proc csv_split {line} {
    # Process each input character.
    set result [list]
    set beg 0
    while {$beg < [string length $line]} {
       if {[string index $line $beg] eq "\""} {
          incr beg
          set quote false
          set word {} 
          foreach char [concat [split [string range $line $beg end] {}] {{}}] {
             # Search forward for the closing quote, one character at a time.
             incr beg
             if {$quote} {
                if {$char in {, {}}} {
                   # Quote followed by comma or end-of-line indicates the end of
                   # the word.
                   break
                } elseif {$char eq "\""} {
                   # Pair of quotes is valid.
                   append word $char
                } else {
                   # No other characters can legally follow quote.  I think.
                   error "extra characters after close-quote"
                }
                set quote false
             } elseif {$char eq {}} {
                # End-of-line inside quotes indicates embedded newline.
                error "embedded newlines not supported"
             } elseif {$char eq "\""} {
                # Treat the next character specially.
                set quote true
             } else {
                # All other characters pass through directly.
                append word $char
             }
          }
          lappend result $word
       } else {
          # Use all characters up to the comma or line ending.
          regexp -start $beg {.*?(?=,|$)} $line word
          lappend result $word
          set beg [expr {$beg + [string length $word] + 1}]
       }
    }

    # If the line ends in a comma, add a blank word to the result.
    if {[string index $line end] eq ","} {
       lappend result {}
    }

    # Done.  Return the result list.
    return $result
}

AMG: Partially reverted PYK's edit [L1 ] to avoid angering the deficiencies of this wiki's syntax highlighter

PYK: Note to self: keep the wiki syntax highlighter happy. AMG, I see that you also corrected an error I introduced by not paying attention to the difference between the empty string and a list containing a single empty string -- sorry about that!

Examples:

% csv_split {a b c,"d e f,",}
{a b c} {d e f,} {}
% csv_split {a b c,"d e f,",g}
{a b c} {d e f,} g
% csv_split {a b c,"d e""f,",g}
{a b c} {d e"f,} g
% csv_split {a b c,"d e""f,","g"}
{a b c} {d e"f,} g
% csv_split {a b c,"d e""f,",""}
{a b c} {d e"f,} {}
% csv_split {a b c,"d e""f,","",""}
{a b c} {d e"f,} {} {}
% csv_split {a b c,"d e""f,","","",","}
{a b c} {d e"f,} {} {} ,
% csv_split {a b c,"d e""f,","","",",",,,}
{a b c} {d e"f,} {} {} , {} {} {}

As you can see, csv_split is hard-wired to support comma as the delimiter and double quote as the quoting operator. This can be changed easily if needed.

csv_split does not support embedded newlines, but at least it recognizes them. It's forced to error on embedded newlines because presumably the caller won't pass it enough characters to complete the record. The caller is assumed to be a gets loop, which has no way of knowing whether any particular newline is quoted or not. To fix, make a version of csv_split that takes all input characters at once and returns a list of lists.

AMG: Hmm, or create a stateful coroutine! Pass it an incomplete row, it returns nothing. Pass it the rest of the row, it returns the complete row. Pass it one and a half rows, it returns the complete row and buffers the rest.

CSV Parser, by andy

andy: Hi, this proc can read complex csv file which contains """, \n, ,, and split row and cell.

proc readCSV {channel {header 1} {symbol ,}} {
    set quote 0

    set data [split [read $channel nonewline] \n]
    set line_count [llength $data]
    if {$line_count != 0} {
        for {append k 0} {$k < $line_count} {incr k} {
            set line [lindex $data $k]
            set quote [expr {$quote + [regexp -all \" $line]}]
            if {$quote % 2 == 0} {
                set quote 0
                append row_temp $line
                set row_temp [split $row_temp $symbol]
                set col_count [llength $row_temp]
                for {set i 0} {$i < $col_count} {incr i} {
                    set section [lindex $row_temp $i]
                    set quote [expr {$quote + [regexp -all \" $section]}]
                    if {$quote % 2 == 0} {
                        append cell_temp $section
                        if {[regexp {"(.*)"} $cell_temp x y]} {
                            set cell_temp $y
                        }
                        lappend cell $cell_temp
                        unset cell_temp
                        set quote 0
                    } else {
                        append cell_temp $section,
                    }
                }
                lappend final [regsub -all {""} $cell \"]
                unset cell
                unset row_temp
            } else {
                append row_temp $line\n
            }
        }
    }
    # generate array if needed, or return $final here
    set row [llength $final]
    set column [llength [lindex $final 0]]
    if {$header == 1} {
        for {set i 0} {$i < $row} {incr i} {
            for {set j 0} {$j < $column} {incr j} {
                set csvData([lindex [lindex $final 0] $j],$i) [
                    lindex [lindex $final $i] $j]
            }
        }
    } else {
        for {set i 0} {$i < $row} {incr i} {
            for {set j 0} {$j < $column} {incr j} {
                set csvData($i,$j) [lindex [lindex $final $i] $j]
            }
        }
    }
    return [array get csvData]
}

Example, get cell by header & row number:

set csv [open c:/testcase.csv RDWR]
array set csvData [readCSV $csv]

# assume there is a cell containing "Name" at first row.
puts $csvData(Name,1)

Example, get cell by row & column number:

set csv [open c:/testcase.csv {RDWR}]
array set csvData [readCSV $csv 0]
puts $csvData(3,1)

Contrast with csv package:

package require csv
package require struct::queue

set csv [ open c:/testcase.csv {RDWR} ]
::struct::queue q
::csv::read2queue $csv q
set final [ q peek [ q size ]]

Capacity	readCSV	csv package	file size
2000*4	103ms	170ms	768KB
2000*8	200ms	335ms	1534KB
2000*16	382ms	770ms	3065KB
2000*32	760ms	2088ms	6127KB
2000*64	1501ms	6411ms	12252KB
2000*128	2995ms	21841ms	24501KB

AMG: This is not my code. It was contributed by another andy. I used andy to mark my pages long ago but now use AMG to avoid confusion. But perhaps in this case I engendered confusion by having changed!

Misc

LV: So, anyone have a pointer to examples of visually displaying a csv files? There are at least two different styles I can think of:

display a csv file as data to be editted and then resaved
display a csv file where each "row" is an item (with description, etc.) that might be selected for subsequent use

JMN: Contrary to the example in the man page (as at 2009-04 v 0.7.1); It appears that the default for csv::split is to use the (IMHO not exactly intuitive) system of converting a quoted empty string to a single quote. Using the same data as the csv_split example above, you need to use -alternate to get the same output.

% csv_split {a b c,"d e""f,",""}
{a b c} {d e"f,} {}

%csv::split -alternate {a b c,"d e""f,",""}
{a b c} {d e"f,} {}

As opposed to the default:

%csv::split {a b c,"d e""f,",""}
{a b c} {d e"f,} {"}

At the very least there is an inconsistency in the documentation between the example and the description. I suppose converting "" to " may be useful in some cases to distinguish between Null and the empty string.

LV: note that the csv 0.7.2 man page says, as point 4 under FORMAT, that "" represents ". So at least the doc now lists that as the situation. Right below that point is a note that -alternate treats "" as an empty string.

2014-11-12: note that RFC4180 does not mention null values or how to distinguish them from empty strings. The "" to " convention is horribly ugly IMHO, but it has some heavy-weight support (e.g. in OWC) so it's better than nothing.

CSV parser without regexps

jaime_villate - 2024-03-08 Here is my version of a simple CSV parser that might be useful to other users. It reads a CSV file and puts its contents into list rows.

proc Get_Fields {line} {
    set quotes 0
    for {set i 0} {$i < [string length $line]} {incr i} {
        set c [string index $line $i]
        switch $c {
            {"} {
                if {$quotes && $i < [expr {[string length $line]-1}] &&
                    [string equal [string index $line [expr {$i+1}]] \"]} {
                    append field $c
                    incr i
                } else {
                    set quotes [expr {1-$quotes}]
                }
            }
            , {
                if {$quotes} {
                    append field $c
                } else {
                    lappend fields [string trim $field]
                    set field {}
                }
            }
            default {append field $c}
        }
    }
    if {!$quotes} {lappend fields [string trim $field]}
    return $fields
}
set filep [open $filename]
while {[gets $filep line] > -1} {
    lappend rows [Get_Fields $line]
}
close $filep

Tools

Tcllib CSV: CSV processing package included with Tcllib
tclcsv: Binary extension for CSV parsing
CSV Sql: implements SQL-like queries on CSV files

What: catdoc
Where: http://mac-www.fe.msk.ru/%7Evitus/catdoc/ 
Description: Extract text from MS-Word (Up through Word 97) files and
       tries to preserve as many special printable characters as possible.
       Includes a small Tk based GUI called wordview as well as a tool called
       xls2csv for getting comma separated value files out of Excel
       spreadsheets.
Updated: 02/1999
Contact: mailto:[email protected]  (Bjorn Brenander)

What: Wagner extensions
Where: http://www.ice.ru/%7Evitus/works/tcl.html 
       http://www.fe.msk.ru/%7Evitus/works/ 
       http://www.ice.ru/%7Evitus/tcl/locale-tcl.html 
Description: Series of Tcl extensions.  Includes an interface to syslog
       (which supports Tcl 7.x/8.0.x and now Tcl 8.1 and newer),
       Cyrillic recoding, plain DBF reading, CSV file handling, tcl split
       procedure that works like perl's split, a patch to Tcl's locale
       handling,
       command line option parsing (Tk getopt like), and conversion of
       nroff output into text chunks.  Users should check the web site
       as new releases occur as needed.
Updated: 06/2000
Contact: mailto:[email protected]  (Victor Wagner)

References

RFC 4180 , Common Format and MIME Type for Comma-Separated Values (CSV) Files: formally specifies a CSV format and a corresponding MIME type.

Wikipedia

csv