Version 48 of csv

Updated 2013-01-18 04:17:14 by pooryorick

CSV = Comma Separated Value

This is a file format that got very popular without being formally described anywhere. It has been used for years — since at the very least the olden days of the first spreadsheets such as VisiCalc.

It's golden days but I like the term you forged: olden days!

Most people today expect that it means either compatible with [their own] application's definition or compatible with Microsoft's Excel program. There is, however, as of 2005 an RFC (#4180 [L1 ]) that formally specifies a CSV format (and a corresponding MIME type).

In general, it is a file format in which each line (aka record) (string of characters terminated by the native text line ending) consists of 0 or more strings of non-commas (aka an element) followed by a comma. NOTE: some applications permit you to change the separator character from a comma to something else, like a ; or !, for cases where the comma is often a part of the element contents.

Some applications have special quoting defined for commas and line ending characters.

An example of a csv record would be:

Cabbage,1.49,1,Kroger
Celery,1.19,1,Meijer

Note that, unlike the more popular today XML, the application will be making a lot of assumptions about the record layout, element format, etc.

What if the element needs to have more than one word? Typically, the convention is to use double quotes:

"Cream cheese, flavored",2.68,12,"Giant Eagle"

AMG: I have noticed that quoting is only necessary when the element contains a separator character (comma or newline). In other words, "Giant Eagle" doesn't need quotes, but "Cream Cheese, flavored" does. But as CSV isn't really standardized, some programs may need quotes where others don't. For that matter, some programs may not support quotes at all, instead forbidding the use of commas inside elements. Also it's likely that some programs fail to support newlines inside quotes. I do not know if quotes are only significant at the start of the element (as in Tcl) or can be used to quote substrings. If the latter is the case, ... see below

CyrilRoux: @AMG Indeed, according to rfc4180 p2-6 "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes" but unfortunately line break feature is not yet implemented in tcllib v 1.13 csv package v0.7.2. It is a real issue.

What if the element has to have a quote (")? Then one specifies two quotes side by side:

"""I can't believe it's not butter""",3.45,12,"Big Bear"

AMG: ... it will be necessary to double up quotes internal to the element even when the element isn't contained in quotes. To be safe, elements containing quotes should be themselves surrounded in quotes.

Some programs indicate an empty element by no data, while others indicate it by an empty pair of quotes:

Batteries,4.47,,Kroger
Batteries,4.47,"","Big Bear"

See also the Wikipedia article .


Other Relevant Pages

* CSV Sql shows code that implements SQL-like queries on CSV files.


Tcllib module

 What: csv module of Tcllib
 Where: http://tcllib.sf.net/ 
        http://tcllib.sourceforge.net/doc/csv.html 
 Description: The tcllib extension has a csv module.
        This module provides support for Comma Separated Value (aka csv)
        formatted files.  These plain text files are a common way to
        export data from spreadsheets or row-based databases.
        Tcllib also comes with a few sample programs demonstrating the
        usefulness of the csv package.
        See the tcllib/examples/csv/ directory for code to convert csv files
        to html, to cut out csv columns, to join csv data from two files,
        to sort csv files by column,
        to do a 'uniq' type function on csv columns, etc.
        Currently at version 0.0 .
 Updated: 07/2007
 Contact: See web site

The csv utility commands in tcllib/examples/csv/ are

csv2html

csv2html ?-sep sepchar? ?-title string? file...

        Reads CSV data from the files and returns it as a HTML table
        on stdout.

csvcut

csvcut  ?-sep sepchar? LIST file...

        Like "cut", but for CSV files. Print selected parts of CSV
        records from each FILE to standard output.

        LIST is a comma separated list of column specifications. The
        allowed forms are:

        N       numeric specification of single column
        N-M     range specification, both parts numberic,
                N < M required.
        -M      See N-M, N defaults to 0.
        N-      See N-M, M defaults to last column

        If there are no files or file = "-" read from stdin.

csvdiff

csvdiff ?-n? ?-sep sepchar? ?-key LIST? file1 file2

        Like "diff", but for CSV files. Compare selected columns of CSV
        records from each FILE to standard output.

        -n indicates that line numbers should be output

        -sep sepchar allows one to indicate that, instead of a comma,
        the sepchar will be separating the CSV columns.

        LIST is a comma separated list of column specifications. The
        allowed forms are:

        N       numeric specification of single column
        N-M     range specification, both parts numberic,
                N < M required.
        -M      See N-M, N defaults to 0.
        N-      See N-M, M defaults to last column

        file1 and file2 are the files to be compared.

Example of use:

$ cat > f1
a|b|c|d|e|f|g|h|i|j|
1|2|3|d|e|F|g|h|i|j|
x|y|z|d|e|f|g|h|i|j|
^D
$ cat > f2
a|b|c|d|e|f|g|h|i|j|
1|2|3|d|e|f|g|h|i|j|
x|y|z|d|e|f|g|h|i|j|
^D
$ csvdiff -sep '|' -key '0 5 8 9' f1 f2
-|1|2|3|d|e|F|g|h|i|j|
+|1|2|3|d|e|f|g|h|i|j|

Note that if you want to compare several fields, I find that I have to use spaces to separate them, rather than commas as the comments imply. Also, if there are multiple lines in one of the files that are identical in the columns specified, a warning similar to this will appear:

warning: 0 2942 0000 R occurs multiple times in f1 (lines 2634 and 2633)

Also, all the first file's lines are output first, then the second file's lines are output.

csvjoin

csvjoin ?-sep sepchar? ?-outer? keycol1 file1.in keycol2 file2.in file.out|-

        Joins the two CSV inputtables using the specified columns as
        keys to compare and associate. The result will contain all
        columns from both files with the exception of the second key
        column (the result needs only one key column, the other is
        identical by definition and therefore superfluous).

        Options:

        -sep    specifies the separator character used in the input file.
                Default is comma.

        -outer  Flag, perform outer join. Means that if the key is
                missing in file2 a record is nevertheless written,
                extended with empty values.

csvsort

csvsort ?-sep sepchar? ?-f? ?-n? ?-r? ?-skip cnt? column file.in|- file.out|-

        Like "sort", but for CSV files. Sorts after the specified
        column. Input and output are from and to a file or stdin
        and stdout (Any combination is possible).

        Options:

        -sep    specifies the separator character used in the input file.
                Default is comma.

        -n      If specified integer sorting is used.
        -f      If specified floating point sorting is used.
                (-n and -f exclude each other. If both are used the
                last option decides the mode).

        -r      If specified reverse sorting is used (largest first)

        -skip   If specified that number of rows is skipped at the beginning,
                i.e. excluded from sorting. This is to allow sorting of
                CSV files with header lines.

csvuniq

csvuniq ?-sep sepchar? column file.in|- file.out|-

        Like "uniq", but for CSV files. Uniq's the specified column.
        Writes the first record it encounters for a value. Input and
        output are from and to a file or stdin and stdout (Any
        combination is possible).

        Options:

        -sep    specifies the separator character used in the input file.
                Default is comma.

[Examples of how to use the above commands would be helpful]

These commands are in the tcllib source tree. If you want to use them, however, you have to install them by hand.


Additional utilities and csv functions are also of interest for future updates to the package.


Other csv related Tcl packages include:

 What: catdoc
 Where: http://mac-www.fe.msk.ru/%7Evitus/catdoc/ 
 Description: Extract text from MS-Word (Up through Word 97) files and
        tries to preserve as many special printable characters as possible.
        Includes a small Tk based GUI called wordview as well as a tool called
        xls2csv for getting comma separated value files out of Excel
        spreadsheets.
 Updated: 02/1999
 Contact: mailto:[email protected]  (Bjorn Brenander)
 What: CSV (comma-separated value) parsing
 Where: Parsing csv strings
 Description: This Wiki page covers the topic of parsing CSV strings.
 Updated: 06/2000
 Contact: Various authors
 What: Wagner extensions
 Where: http://www.ice.ru/%7Evitus/works/tcl.html 
        http://www.fe.msk.ru/%7Evitus/works/ 
        http://www.ice.ru/%7Evitus/tcl/locale-tcl.html 
 Description: Series of Tcl extensions.  Includes an interface to syslog
        (which supports Tcl 7.x/8.0.x and now Tcl 8.1 and newer),
        Cyrillic recoding, plain DBF reading, CSV file handling, tcl split
        procedure that works like perl's split, a patch to Tcl's locale
        handling,
        command line option parsing (Tk getopt like), and conversion of
        nroff output into text chunks.  Users should check the web site
        as new releases occur as needed.
 Updated: 06/2000
 Contact: mailto:[email protected]  (Victor Wagner)

Peter Lewerin Tcl + Tcllib's csv module saved the largest amateur database of folk music in Sweden (>100k tunes). The collector was using an old database custom-written in Turbo Pascal, and as the computers were beginning to go decrepit, he was wishing that he could migrate to Access. Enter me with my Tcl CD. I wrote a probe to find out the database format as he was watching (all sources and specifications were lost), then a dumper to translate it into a CSV file, and imported into Access. Now he thinks that content management is really easy.

unperson Well these old standards can prove to be greatly useful, ain't they?

The conversion operation seemed easy after you put in a lot of hours and effort! This is what great programmers manage to convey: a sense that a task is easy.

A CSV file indeed looks really terrible with the commas and the quotation marks. But when the programmer is smart, the user never sees the commas nor the quotation marks; the program converts them right away at entry, the comma delimited file into many columns. The same way the excellent Forth programmer did in a DOS program called Rapid File. See: http://c2.com/cgi/wiki?RapidFile .


Example

How to use the csv package to write the format.

  package require csv

  # Make the lists to convert to csv-format
  set exampleList1 {123 123,521.2}
  lappend exampleList1 {Mary says "Hello, I am Mary"}
  lappend exampleList1 {}
  set exampleList2 {a b c d e f}
  set exampleList3 {}
  for {set i 0} {$i < 10} {incr i} {
     lappend exampleList3 $i
  }
  # Make a list of lists...
  set exampleLists [list $exampleList1 $exampleList2 $exampleList3]

  # Write the data to a file
  set f [open exampleLists.csv w]
  puts $f [csv::joinlist $exampleLists]
  close $f

The result of running this program is 4 lines - one for each example list, and an empty line.


JDW The "empty line" (mentioned above) is a nuisance for some applications. It is the result of [csv::joinlist] including a newline at the end of every line, rather than as a delimiter between lines. Then the [puts] adds another newline. The extra newline can be avoided by using the following construct:

  % puts -nonewline $f [csv::joinlist $exampleLists]

Of course, the [write_file] command from Tclx would would make the [open]/[puts]/[close] sequence all one line:

  % write_file -nonewline exampleLists.csv [csv::joinlist $exampleLists]

However, -nonewline isn't supported on write_file. My first thought was that the extra newline shouldn't be added by [csv::joinlist], but perhaps the real deficiency is that [write_file] should support -nonewline. One way or the other, it would be handy to make [write_file] and [csv::joinlist] work together.

The (very ugly) workaround I've come up with is:

  % write_file exampleLists.csv [string range [csv::joinlist $exampleLists] 0 end-1]

Of course, that's probably not efficient for writing non-trivial file sizes.

In case this behavior is version-dependant, this was tested using ActiveTcl 8.4.19.1 on Linux.


AMG: ESR on CSV [L2 ]: "In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field."

Sigh!


See also Formatting file information in CSV, Parsing csv strings, Parsing csv file, How to plot a graph, report, Anyone interested in doing a comma delimited database?


AMG: If for some contorted reason you don't have access to tcllib but still need CSV reading capability, here's some code that might just do the job:

proc csv_split {line} {
    # Process each input character.
    set result [list]
    set beg 0
    while {$beg < [string length $line]} {
       if {[string index $line $beg] eq "\""} {
          incr beg
          set quote false
          set word ""
          foreach char [concat [split [string range $line $beg end] ""] [list ""]] {
             # Search forward for the closing quote, one character at a time.
             incr beg
             if {$quote} {
                if {$char in {, ""}} {
                   # Quote followed by comma or end-of-line indicates the end of
                   # the word.
                   break
                } elseif {$char eq "\""} {
                   # Pair of quotes is valid.
                   append word $char
                } else {
                   # No other characters can legally follow quote.  I think.
                   error "extra characters after close-quote"
                }
                set quote false
             } elseif {$char eq ""} {
                # End-of-line inside quotes indicates embedded newline.
                error "embedded newlines not supported"
             } elseif {$char eq "\""} {
                # Treat the next character specially.
                set quote true
             } else {
                # All other characters pass through directly.
                append word $char
             }
          }
          lappend result $word
       } else {
          # Use all characters up to the comma or line ending.
          regexp -start $beg {.*?(?=,|$)} $line word
          lappend result $word
          set beg [expr {$beg + [string length $word] + 1}]
       }
    }

    # If the line ends in a comma, add a blank word to the result.
    if {[string index $line end] eq ","} {
       lappend result ""
    }

    # Done.  Return the result list.
    return $result
}

Examples:

% csv_split {a b c,"d e f,",}
{a b c} {d e f,} {}
% csv_split {a b c,"d e f,",g}
{a b c} {d e f,} g
% csv_split {a b c,"d e""f,",g}
{a b c} {d e"f,} g
% csv_split {a b c,"d e""f,","g"}
{a b c} {d e"f,} g
% csv_split {a b c,"d e""f,",""}
{a b c} {d e"f,} {}
% csv_split {a b c,"d e""f,","",""}
{a b c} {d e"f,} {} {}
% csv_split {a b c,"d e""f,","","",","}
{a b c} {d e"f,} {} {} ,
% csv_split {a b c,"d e""f,","","",",",,,}
{a b c} {d e"f,} {} {} , {} {} {}

As you can see, [csv_split] is hard-wired to support comma as the delimiter and double quote as the quoting operator. This can be changed easily if needed.

[csv_split] does not support embedded newlines, but at least it recognizes them. It's forced to error on embedded newlines because presumably the caller won't pass it enough characters to complete the record. The caller is assumed to be a [gets] loop, which has no way of knowing whether any particular newline is quoted or not. To fix, make a version of [csv_split] that takes all input characters at once and returns a list of lists.

AMG: Hmm, or create a stateful coroutine! Pass it an incomplete row, it returns nothing. Pass it the rest of the row, it returns the complete row. Pass it one and a half rows, it returns the complete row and buffers the rest.


LV So, anyone have a pointer to examples of visually displaying a csv files? There are at least two different styles I can think of:

  • display a csv file as data to be editted and then resaved
  • display a csv file where each "row" is an item (with description, etc.) that might be selected for subsequent use

JMN Contrary to the example in the man page (as at 2009-04 v 0.7.1); It appears that the default for csv::split is to use the (IMHO not exactly intuitive) system of converting a quoted empty string to a single quote. Using the same data as the csv_split example above, you need to use the '-alternate' flag to get the same output.

% csv_split {a b c,"d e""f,",""}
{a b c} {d e"f,} {}
%csv::split -alternate {a b c,"d e""f,",""}
{a b c} {d e"f,} {}

As opposed to the default:

%csv::split {a b c,"d e""f,",""}
{a b c} {d e"f,} {"}

At the very least there is an inconsistency in the documentation between the example and the description. I suppose converting "" to " may be useful in some cases to distinguish between Null and the empty string.

LV note that the csv 0.7.2 man page says, as point 4 under FORMAT, that "" represents ". So at least the doc now lists that as the situation. Right below that point is a note that -alternate treats "" as an empty string.


andy: Hi, this proc can read complex csv file which contains ""","\n",",", and split row and cell.

proc readCSV {channel {header 1} {symbol ,}} {
    set quote 0

    set data [split [read $channel nonewline] \n]
    set line_count [llength $data]
    if {$line_count != 0} {
        for {append k 0} {$k < $line_count} {incr k} {
            set line [lindex $data $k]
            set quote [expr {$quote + [regexp -all \" $line]}]
            if {[expr {$quote % 2} ] == 0} {
                set quote 0
                append row_temp $line
                set row_temp [split $row_temp $symbol]
                set col_count [llength $row_temp]
                for {set i 0} {$i < $col_count} {incr i} {
                    set section [lindex $row_temp $i]
                    set quote [expr {$quote + [regexp -all \" $section]}]
                    if {[expr {$quote % 2 }] == 0} {
                        append cell_temp $section
                        if {[regexp {"(.*)"} $cell_temp x y]} {
                            set cell_temp $y
                        }
                        lappend cell $cell_temp
                        unset cell_temp
                        set quote 0
                    } else {
                        append cell_temp $section,
                    }
                }
                lappend final [regsub -all {""} $cell \"]
                unset cell
                unset row_temp
            } else {
                append row_temp $line\n
            }
        }
    }
    # generate array if needed, or return $final here
    set row [llength $final]
    set column [llength [lindex $final 0]]
    if {$header == 1} {
        for {set i 0} {$i < $row} {incr i} {
            for {set j 0} {$j < $column} {incr j} {
                set csvData([lindex [lindex $final 0] $j],$i) [lindex [lindex $final $i] $j]
            }
        }
    } else {
        for {set i 0} {$i < $row} {incr i} {
            for {set j 0} {$j < $column} {incr j} {
                set csvData($i,$j) [lindex [lindex $final $i] $j]
            }
        }
    }
    return [array get csvData]
}

Example, get cell by header & row number:

set csv [ open c:/testcase.csv {RDWR} ]
array set csvData [ readCSV $csv ]
puts $csvData(Name,1)    ;# assume there is a cell containing "Name" at first row.

Example, get cell by row & column number:

set csv [open c:/testcase.csv {RDWR}]
array set csvData [readCSV $csv 0]
puts $csvData(3,1)

Contrast with csv package:

package require csv
package require struct::queue

set csv [ open c:/testcase.csv {RDWR} ]
::struct::queue q
::csv::read2queue $csv q
set final [ q peek [ q size ]]
CapacityreadCSVcsv packagefile size
2000*4103ms170ms768KB
2000*8200ms335ms1534KB
2000*16382ms770ms3065KB
2000*32760ms2088ms6127KB
2000*641501ms6411ms12252KB
2000*1282995ms21841ms24501KB

See: CGI script to edit csv file