Version 66 of csv

Updated 2015-03-25 16:52:43 by AMG

Comma Separated Values, or CSV, is a textual format for table/record data.

Tools

Tcllib csv

Also see below for a list of other tools

See Also

XML
RapidFile
a non-relational flat-file database written in MMSFORTH in the early days of personal computers
Formatting file information in CSV
Parsing csv strings
Parsing csv file
How to plot a graph
report
Anyone interested in doing a comma delimited database?
CGI script to edit csv file

Documentation

RFC 4180 , Common Format and MIME Type for Comma-Separated Values (CSV) Files
formally specifies a CSV format and a corresponding MIME type.
Wikipedia
CSV Sql
implements SQL-like queries on CSV files

Description

The venerable CSV format has been used at least since the days of the first spreadsheets such as VisiCalc as a common export format for table data such as spreadsheets. It uses commas to separate fields, and native line-ending character(s) of the host system to separate records. Commas and newlines can be escaped with double quotes, which in turn can be escaped by prepending another double quote character to. Two consecutive commas indicate an empty field. These plain text files are a common way to export data from spreadsheets or row-based databases.

CSV started out as an ad-hoc format, and many ad-hoc flavours of CSV exist in the wild. The Microsoft Excel implementation is widely considered as the reference implementation.

Some locales require the use of other delimiter characters. In the German locales the semicolon is used in csv-files as field delimiter as the comma is used as decimal separator. Some applications permit allow the field and/or record delimiters to be specified, giving rise to the general term, Delimiter-Separated Values.

Some applications have special quoting defined for commas and line ending characters.

An example of a csv format:

Cabbage,1.49,1,Kroger
Celery,1.19,1,Meijer

Unlike the more popular today XML, the consuming application will be making a lot of assumptions about the record layout, element format, etc.

Where a comma character is part of the record value, it can be escaped using double quotes:

"Cream cheese, flavored",2.68,12,"Giant Eagle"

Some programs only quote values when necessary, but other programs quote all values containing commas or spaces, and some programs simply quote all values. Other programs don't support value quoting at all, and values containing comma or the record delimiter are not allowed.

ZB: That's why CSV "tab-delimited" variant is soooo much better... quite nothing to worry about, since there's no need to quote anything at all

The record delimiter can also be quoted:

"Cream
   cheese, flavored",2.68,12,"Giant Eagle"

CSV parsers, particularly quickly-coded, ad-hoc parsers, don't support such complexities.

Double quotes generally encapsulate an entire value, although parsers surely exist that interpret double quotes embedded in values.

A double quote that is part of a value can be escaped by prepending another double quote to it:

"""I can't believe it's not butter""",3.45,12,"Big Bear"

AMG: ... it will be necessary to double up quotes internal to the element even when the element isn't contained in quotes. To be safe, elements containing quotes should be themselves surrounded in quotes.

Some programs indicate an empty element by no data, while others indicate it by an empty pair of quotes:

Batteries,4.47,,Kroger
Batteries,4.47,"","Big Bear"

AMG: The Art of Unix Programming , by Eric S Raymond, on CSV: "In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field."

Sigh!

Case Study: Folk Music in Sweden

PL: Tcl + Tcllib's csv module saved the largest amateur database of folk music in Sweden (>100k tunes). The collector was using an old database custom-written in Turbo Pascal, and as the computers were beginning to go decrepit, he was wishing that he could migrate to Access. Enter me with my Tcl CD. I wrote a probe to find out the database format as he was watching (all sources and specifications were lost), then a dumper to translate it into a CSV file, and imported into Access. Now he thinks that content management is really easy.

Other Tools

Other CSV-related Tcl packages include:

What: catdoc
Where: http://mac-www.fe.msk.ru/%7Evitus/catdoc/ 
Description: Extract text from MS-Word (Up through Word 97) files and
       tries to preserve as many special printable characters as possible.
       Includes a small Tk based GUI called wordview as well as a tool called
       xls2csv for getting comma separated value files out of Excel
       spreadsheets.
Updated: 02/1999
Contact: mailto:[email protected]  (Bjorn Brenander)
What: CSV (comma-separated value) parsing
Where: Parsing csv strings
Description: This Wiki page covers the topic of parsing CSV strings.
Updated: 06/2000
Contact: Various authors
What: Wagner extensions
Where: http://www.ice.ru/%7Evitus/works/tcl.html 
       http://www.fe.msk.ru/%7Evitus/works/ 
       http://www.ice.ru/%7Evitus/tcl/locale-tcl.html 
Description: Series of Tcl extensions.  Includes an interface to syslog
       (which supports Tcl 7.x/8.0.x and now Tcl 8.1 and newer),
       Cyrillic recoding, plain DBF reading, CSV file handling, tcl split
       procedure that works like perl's split, a patch to Tcl's locale
       handling,
       command line option parsing (Tk getopt like), and conversion of
       nroff output into text chunks.  Users should check the web site
       as new releases occur as needed.
Updated: 06/2000
Contact: mailto:[email protected]  (Victor Wagner)

CSV Parser, by AMG

proc csv_split {line} {
    # Process each input character.
    set result [list]
    set beg 0
    while {$beg < [string length $line]} {
       if {[string index $line $beg] eq "\""} {
          incr beg
          set quote false
          set word {} 
          foreach char [concat [split [string range $line $beg end] {}] {{}}] {
             # Search forward for the closing quote, one character at a time.
             incr beg
             if {$quote} {
                if {$char in {, {}}} {
                   # Quote followed by comma or end-of-line indicates the end of
                   # the word.
                   break
                } elseif {$char eq "\""} {
                   # Pair of quotes is valid.
                   append word $char
                } else {
                   # No other characters can legally follow quote.  I think.
                   error "extra characters after close-quote"
                }
                set quote false
             } elseif {$char eq {}} {
                # End-of-line inside quotes indicates embedded newline.
                error "embedded newlines not supported"
             } elseif {$char eq "\""} {
                # Treat the next character specially.
                set quote true
             } else {
                # All other characters pass through directly.
                append word $char
             }
          }
          lappend result $word
       } else {
          # Use all characters up to the comma or line ending.
          regexp -start $beg {.*?(?=,|$)} $line word
          lappend result $word
          set beg [expr {$beg + [string length $word] + 1}]
       }
    }

    # If the line ends in a comma, add a blank word to the result.
    if {[string index $line end] eq ","} {
       lappend result {}
    }

    # Done.  Return the result list.
    return $result
}

AMG: Partially reverted PYK's edit [L1 ] to avoid angering the deficiencies of this wiki's syntax highlighter

PYK: Note to self: keep the wiki syntax highlighter happy. AMG, I see that you also corrected an error I introduced by not paying attention to the difference between the empty string and a list containing a single empty string -- sorry about that!

Examples:

% csv_split {a b c,"d e f,",}
{a b c} {d e f,} {}
% csv_split {a b c,"d e f,",g}
{a b c} {d e f,} g
% csv_split {a b c,"d e""f,",g}
{a b c} {d e"f,} g
% csv_split {a b c,"d e""f,","g"}
{a b c} {d e"f,} g
% csv_split {a b c,"d e""f,",""}
{a b c} {d e"f,} {}
% csv_split {a b c,"d e""f,","",""}
{a b c} {d e"f,} {} {}
% csv_split {a b c,"d e""f,","","",","}
{a b c} {d e"f,} {} {} ,
% csv_split {a b c,"d e""f,","","",",",,,}
{a b c} {d e"f,} {} {} , {} {} {}

As you can see, csv_split is hard-wired to support comma as the delimiter and double quote as the quoting operator. This can be changed easily if needed.

csv_split does not support embedded newlines, but at least it recognizes them. It's forced to error on embedded newlines because presumably the caller won't pass it enough characters to complete the record. The caller is assumed to be a gets loop, which has no way of knowing whether any particular newline is quoted or not. To fix, make a version of csv_split that takes all input characters at once and returns a list of lists.

AMG: Hmm, or create a stateful coroutine! Pass it an incomplete row, it returns nothing. Pass it the rest of the row, it returns the complete row. Pass it one and a half rows, it returns the complete row and buffers the rest.

AMG: Here's a tiny CSV splitter I just made, built around [regexp]:

proc csvSplit {line} {
    set result {}
    foreach {_ unquoted quoted} [regexp -all -inline\
            {,((?:[^,\"]|"")*)|,\"((?:[^\"]|"")*)\"} ,$line] {
        if {$quoted ne {}} {
            lappend result [string map {\"\" \"} $quoted]
        } else {
            lappend result [string map {\"\" \"} $unquoted]
        }
    }
    return $result
}

Misc

LV: So, anyone have a pointer to examples of visually displaying a csv files? There are at least two different styles I can think of:

  • display a csv file as data to be editted and then resaved
  • display a csv file where each "row" is an item (with description, etc.) that might be selected for subsequent use

JMN:

Contrary to the example in the man page (as at 2009-04 v 0.7.1); It appears that the default for csv::split is to use the (IMHO not exactly intuitive) system of converting a quoted empty string to a single quote. Using the same data as the csv_split example above, you need to use -alternate to get the same output.

% csv_split {a b c,"d e""f,",""}
{a b c} {d e"f,} {}
%csv::split -alternate {a b c,"d e""f,",""}
{a b c} {d e"f,} {}

As opposed to the default:

%csv::split {a b c,"d e""f,",""}
{a b c} {d e"f,} {"}

At the very least there is an inconsistency in the documentation between the example and the description. I suppose converting "" to " may be useful in some cases to distinguish between Null and the empty string.

LV: note that the csv 0.7.2 man page says, as point 4 under FORMAT, that "" represents ". So at least the doc now lists that as the situation. Right below that point is a note that -alternate treats "" as an empty string.

PL 2014-11-12: note that RFC4180 does not mention null values or how to distinguish them from empty strings. The "" to " convention is horribly ugly IMHO, but it has some heavy-weight support (e.g. in OWC) so it's better than nothing.

CSV Parser, by andy

andy: Hi, this proc can read complex csv file which contains """, \n, ,, and split row and cell.

proc readCSV {channel {header 1} {symbol ,}} {
    set quote 0

    set data [split [read $channel nonewline] \n]
    set line_count [llength $data]
    if {$line_count != 0} {
        for {append k 0} {$k < $line_count} {incr k} {
            set line [lindex $data $k]
            set quote [expr {$quote + [regexp -all \" $line]}]
            if {[expr {$quote % 2}] == 0} {
                set quote 0
                append row_temp $line
                set row_temp [split $row_temp $symbol]
                set col_count [llength $row_temp]
                for {set i 0} {$i < $col_count} {incr i} {
                    set section [lindex $row_temp $i]
                    set quote [expr {$quote + [regexp -all \" $section]}]
                    if {[expr {$quote % 2 }] == 0} {
                        append cell_temp $section
                        if {[regexp {"(.*)"} $cell_temp x y]} {
                            set cell_temp $y
                        }
                        lappend cell $cell_temp
                        unset cell_temp
                        set quote 0
                    } else {
                        append cell_temp $section,
                    }
                }
                lappend final [regsub -all {""} $cell \"]
                unset cell
                unset row_temp
            } else {
                append row_temp $line\n
            }
        }
    }
    # generate array if needed, or return $final here
    set row [llength $final]
    set column [llength [lindex $final 0]]
    if {$header == 1} {
        for {set i 0} {$i < $row} {incr i} {
            for {set j 0} {$j < $column} {incr j} {
                set csvData([lindex [lindex $final 0] $j],$i) [
                    lindex [lindex $final $i] $j]
            }
        }
    } else {
        for {set i 0} {$i < $row} {incr i} {
            for {set j 0} {$j < $column} {incr j} {
                set csvData($i,$j) [lindex [lindex $final $i] $j]
            }
        }
    }
    return [array get csvData]
}

Example, get cell by header & row number:

set csv [open c:/testcase.csv RDWR]
array set csvData [readCSV $csv]

# assume there is a cell containing "Name" at first row.
puts $csvData(Name,1)

Example, get cell by row & column number:

set csv [open c:/testcase.csv {RDWR}]
array set csvData [readCSV $csv 0]
puts $csvData(3,1)

Contrast with csv package:

package require csv
package require struct::queue

set csv [ open c:/testcase.csv {RDWR} ]
::struct::queue q
::csv::read2queue $csv q
set final [ q peek [ q size ]]
%|Capacity|readCSV|csv package|file size|%
&|2000*4|103ms|170ms|768KB|&
&|2000*8|200ms|335ms|1534KB|&
&|2000*16|382ms|770ms|3065KB|&
&|2000*32|760ms|2088ms|6127KB|&
&|2000*64|1501ms|6411ms|12252KB|&
&|2000*128|2995ms|21841ms|24501KB|&