**CSV = '''C'''omma '''S'''eparated '''V'''alue** This is a file format that got very popular without being formally described anywhere. It has been used for years — since at the very least the olden days of the first spreadsheets such as VisiCalc. ''It's golden days but I like the term you forged: olden days!'' Most people today expect that it means either ''compatible with [[their own]] application's definition'' or ''compatible with [Microsoft]'s [Excel] program''. There is, however, as of 2005 an [RFC] (#4180 [http://tools.ietf.org/html/rfc4180]) that formally specifies a CSV format (and a corresponding [MIME type]). In general, it is a file format in which each ''line'' (aka '''record''') (string of characters terminated by the native text line ending) consists of 0 or more strings of non-commas (aka '''an element''') followed by a comma. '''NOTE: some applications permit you to change the separator character from a comma to something else, like a ; or !, for cases where the comma is often a part of the element contents.''' Some applications have special quoting defined for commas and line ending characters. An example of a csv record would be: Cabbage,1.49,1,Kroger Celery,1.19,1,Meijer Note that, unlike the more popular today [XML], the application will be making a lot of assumptions about the record layout, element format, etc. What if the element needs to have more than one word? Typically, the convention is to use double quotes: "Cream cheese, flavored",2.68,12,"Giant Eagle" ''[AMG]: I have noticed that quoting is only necessary when the element contains a separator character (comma or newline). In other words, "Giant Eagle" doesn't need quotes, but "Cream Cheese, flavored" does. But as CSV isn't really standardized, some programs may need quotes where others don't. For that matter, some programs may not support quotes at all, instead forbidding the use of commas inside elements. Also it's likely that some programs fail to support newlines inside quotes. I do not know if quotes are only significant at the start of the element (as in Tcl) or can be used to quote substrings. If the latter is the case, ... see below'' ''[CyrilRoux]: @[AMG] Indeed, according to http://tools.ietf.org/html/rfc4180%|%rfc4180 p2-6%|% "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes" '''but''' unfortunately line break feature is not yet implemented in tcllib v 1.13 csv package v0.7.2. It is a real issue.'' What if the element has to have a quote (")? Then one specifies two quotes side by side: """I can't believe it's not butter""",3.45,12,"Big Bear" ''[AMG]: ... it will be necessary to double up quotes internal to the element even when the element isn't contained in quotes. To be safe, elements containing quotes should be themselves surrounded in quotes.'' Some programs indicate an empty element by no data, while others indicate it by an empty pair of quotes: Batteries,4.47,,Kroger Batteries,4.47,"","Big Bear" See also the [http://en.wikipedia.org/wiki/Comma-separated_values%|%Wikipedia article%|%]. ---- ** Other Relevant Pages ** * [CSV Sql] shows code that implements SQL-like queries on CSV files. ---- ** Tcllib module ** What: csv module of Tcllib Where: http://tcllib.sf.net/ http://tcllib.sourceforge.net/doc/csv.html Description: The tcllib extension has a csv module. This module provides support for Comma Separated Value (aka csv) formatted files. These plain text files are a common way to export data from spreadsheets or row-based databases. Tcllib also comes with a few sample programs demonstrating the usefulness of the csv package. See the tcllib/examples/csv/ directory for code to convert csv files to html, to cut out csv columns, to join csv data from two files, to sort csv files by column, to do a 'uniq' type function on csv columns, etc. Currently at version 0.0 . Updated: 07/2007 Contact: See web site The csv utility commands in tcllib/examples/csv/ are ***csv2html*** === csv2html ?-sep sepchar? ?-title string? file... Reads CSV data from the files and returns it as a HTML table on stdout. === ***csvcut*** === csvcut ?-sep sepchar? LIST file... Like "cut", but for CSV files. Print selected parts of CSV records from each FILE to standard output. LIST is a comma separated list of column specifications. The allowed forms are: N numeric specification of single column N-M range specification, both parts numberic, N < M required. -M See N-M, N defaults to 0. N- See N-M, M defaults to last column If there are no files or file = "-" read from stdin. === ***csvdiff*** === csvdiff ?-n? ?-sep sepchar? ?-key LIST? file1 file2 Like "diff", but for CSV files. Compare selected columns of CSV records from each FILE to standard output. -n indicates that line numbers should be output -sep sepchar allows one to indicate that, instead of a comma, the sepchar will be separating the CSV columns. LIST is a comma separated list of column specifications. The allowed forms are: N numeric specification of single column N-M range specification, both parts numberic, N < M required. -M See N-M, N defaults to 0. N- See N-M, M defaults to last column file1 and file2 are the files to be compared. === Example of use: === $ cat > f1 a|b|c|d|e|f|g|h|i|j| 1|2|3|d|e|F|g|h|i|j| x|y|z|d|e|f|g|h|i|j| ^D $ cat > f2 a|b|c|d|e|f|g|h|i|j| 1|2|3|d|e|f|g|h|i|j| x|y|z|d|e|f|g|h|i|j| ^D $ csvdiff -sep '|' -key '0 5 8 9' f1 f2 -|1|2|3|d|e|F|g|h|i|j| +|1|2|3|d|e|f|g|h|i|j| === Note that if you want to compare several fields, I find that I have to use spaces to separate them, rather than commas as the comments imply. Also, if there are multiple lines in one of the files that are identical in the columns specified, a warning similar to this will appear: === warning: 0 2942 0000 R occurs multiple times in f1 (lines 2634 and 2633) === Also, all the first file's lines are output first, then the second file's lines are output. ***csvjoin*** === csvjoin ?-sep sepchar? ?-outer? keycol1 file1.in keycol2 file2.in file.out|- Joins the two CSV inputtables using the specified columns as keys to compare and associate. The result will contain all columns from both files with the exception of the second key column (the result needs only one key column, the other is identical by definition and therefore superfluous). Options: -sep specifies the separator character used in the input file. Default is comma. -outer Flag, perform outer join. Means that if the key is missing in file2 a record is nevertheless written, extended with empty values. === ***csvsort*** === csvsort ?-sep sepchar? ?-f? ?-n? ?-r? ?-skip cnt? column file.in|- file.out|- Like "sort", but for CSV files. Sorts after the specified column. Input and output are from and to a file or stdin and stdout (Any combination is possible). Options: -sep specifies the separator character used in the input file. Default is comma. -n If specified integer sorting is used. -f If specified floating point sorting is used. (-n and -f exclude each other. If both are used the last option decides the mode). -r If specified reverse sorting is used (largest first) -skip If specified that number of rows is skipped at the beginning, i.e. excluded from sorting. This is to allow sorting of CSV files with header lines. === ***csvuniq*** === csvuniq ?-sep sepchar? column file.in|- file.out|- Like "uniq", but for CSV files. Uniq's the specified column. Writes the first record it encounters for a value. Input and output are from and to a file or stdin and stdout (Any combination is possible). Options: -sep specifies the separator character used in the input file. Default is comma. === [[Examples of how to use the above commands would be helpful]] These commands are in the [tcllib] source tree. If you want to use them, however, you have to install them by hand. ---- Additional utilities and csv functions are also of interest for future updates to the package. ---- Other csv related Tcl packages include: What: catdoc Where: http://mac-www.fe.msk.ru/%7Evitus/catdoc/ Description: Extract text from MS-Word (Up through Word 97) files and tries to preserve as many special printable characters as possible. Includes a small Tk based GUI called wordview as well as a tool called xls2csv for getting comma separated value files out of Excel spreadsheets. Updated: 02/1999 Contact: mailto:bjorn@debian.org (Bjorn Brenander) What: CSV (comma-separated value) parsing Where: http://purl.org/tcl/wiki/721.html Description: This Wiki page covers the topic of parsing CSV strings. Updated: 06/2000 Contact: Various authors What: Wagner extensions Where: http://www.ice.ru/%7Evitus/works/tcl.html http://www.fe.msk.ru/%7Evitus/works/ http://www.ice.ru/%7Evitus/tcl/locale-tcl.html Description: Series of Tcl extensions. Includes an interface to syslog (which supports Tcl 7.x/8.0.x and now Tcl 8.1 and newer), Cyrillic recoding, plain DBF reading, CSV file handling, tcl split procedure that works like perl's split, a patch to Tcl's locale handling, command line option parsing (Tk getopt like), and conversion of nroff output into text chunks. Users should check the web site as new releases occur as needed. Updated: 06/2000 Contact: mailto:vitus@ice.ru (Victor Wagner) ---- [Peter Lewerin] Tcl + Tcllib's csv module saved the largest amateur database of folk music in Sweden (>100k tunes). The collector was using an old database custom-written in Turbo Pascal, and as the computers were beginning to go decrepit, he was wishing that he could migrate to Access. Enter me with my Tcl CD. I wrote a probe to find out the database format as he was watching (all sources and specifications were lost), then a dumper to translate it into a CSV file, and imported into Access. Now he thinks that content management is ''really'' easy. [unperson] Well these old standards can prove to be greatly useful, ain't they? The conversion operation seemed easy after you put in a lot of hours and effort! This is what great programmers manage to convey: a sense that a task is easy. A CSV file indeed looks really terrible with the commas and the quotation marks. But when the programmer is smart, the user never sees the commas nor the quotation marks; the program converts them right away at entry, the comma delimited file into many columns. The same way the excellent Forth programmer did in a DOS program called Rapid File. See: http://c2.com/cgi/wiki?RapidFile . ---- '''EXAMPLE''' How to use the '''csv''' package to write the format. ====== package require csv # Make the lists to convert to csv-format set exampleList1 {123 123,521.2} lappend exampleList1 {Mary says "Hello, I am Mary"} lappend exampleList1 {} set exampleList2 {a b c d e f} set exampleList3 {} for {set i 0} {$i < 10} {incr i} { lappend exampleList3 $i } # Make a list of lists... set exampleLists [list $exampleList1 $exampleList2 $exampleList3] # Write the data to a file set f [open exampleLists.csv w] puts $f [csv::joinlist $exampleLists] close $f ====== The result of running this program is 4 lines - one for each example list, and an empty line. ---- [JDW] The "empty line" (mentioned above) is a nuisance for some applications. It is the result of [[csv::joinlist]] including a newline at the end of every line, rather than as a delimiter between lines. Then the [[puts]] adds another newline. The extra newline can be avoided by using the following construct: ====== % puts -nonewline $f [csv::joinlist $exampleLists] ====== Of course, the [[write_file]] command from Tclx would would make the [[open]]/[[puts]]/[[close]] sequence all one line: ====== % write_file -nonewline exampleLists.csv [csv::joinlist $exampleLists] ====== However, -nonewline isn't supported on write_file. My first thought was that the extra newline shouldn't be added by [[csv::joinlist]], but perhaps the real deficiency is that [[write_file]] should support -nonewline. One way or the other, it would be handy to make [[write_file]] and [[csv::joinlist]] work together. The (very ugly) workaround I've come up with is: ====== % write_file exampleLists.csv [string range [csv::joinlist $exampleLists] 0 end-1] ====== Of course, that's probably not efficient for writing non-trivial file sizes. In case this behavior is version-dependant, this was tested using ActiveTcl 8.4.19.1 on Linux. ---- [AMG]: ESR on CSV [http://www.faqs.org/docs/artu/ch05s02.html#id2901882]: "In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field." Sigh! ---- See also [Formatting file information in CSV], [Parsing csv strings], [Parsing csv file], [How to plot a graph], [report], [Anyone interested in doing a comma delimited database?] ---- [AMG]: If for some contorted reason you don't have access to [tcllib] but still need CSV reading capability, here's some code that might just do the job: ====== proc csv_split {line} { # Process each input character. set result [list] set beg 0 while {$beg < [string length $line]} { if {[string index $line $beg] eq "\""} { incr beg set quote false set word "" foreach char [concat [split [string range $line $beg end] ""] [list ""]] { # Search forward for the closing quote, one character at a time. incr beg if {$quote} { if {$char in {, ""}} { # Quote followed by comma or end-of-line indicates the end of # the word. break } elseif {$char eq "\""} { # Pair of quotes is valid. append word $char } else { # No other characters can legally follow quote. I think. error "extra characters after close-quote" } set quote false } elseif {$char eq ""} { # End-of-line inside quotes indicates embedded newline. error "embedded newlines not supported" } elseif {$char eq "\""} { # Treat the next character specially. set quote true } else { # All other characters pass through directly. append word $char } } lappend result $word } else { # Use all characters up to the comma or line ending. regexp -start $beg {.*?(?=,|$)} $line word lappend result $word set beg [expr {$beg + [string length $word] + 1}] } } # If the line ends in a comma, add a blank word to the result. if {[string index $line end] eq ","} { lappend result "" } # Done. Return the result list. return $result } ====== Examples: % csv_split {a b c,"d e f,",} {a b c} {d e f,} {} % csv_split {a b c,"d e f,",g} {a b c} {d e f,} g % csv_split {a b c,"d e""f,",g} {a b c} {d e"f,} g % csv_split {a b c,"d e""f,","g"} {a b c} {d e"f,} g % csv_split {a b c,"d e""f,",""} {a b c} {d e"f,} {} % csv_split {a b c,"d e""f,","",""} {a b c} {d e"f,} {} {} % csv_split {a b c,"d e""f,","","",","} {a b c} {d e"f,} {} {} , % csv_split {a b c,"d e""f,","","",",",,,} {a b c} {d e"f,} {} {} , {} {} {} As you can see, [[csv_split] is hard-wired to support comma as the delimiter and double quote as the quoting operator. This can be changed easily if needed. [[csv_split]] does not support embedded newlines, but at least it recognizes them. It's forced to error on embedded newlines because presumably the caller won't pass it enough characters to complete the record. The caller is assumed to be a [[[gets]]] loop, which has no way of knowing whether any particular newline is quoted or not. To fix, make a version of [[csv_split]] that takes all input characters at once and returns a list of lists. [AMG]: Hmm, or create a stateful [coroutine]! Pass it an incomplete row, it returns nothing. Pass it the rest of the row, it returns the complete row. Pass it one and a half rows, it returns the complete row and buffers the rest. ---- [LV] So, anyone have a pointer to examples of visually displaying a csv files? There are at least two different styles I can think of: * display a csv file as data to be editted and then resaved * display a csv file where each "row" is an item (with description, etc.) that might be selected for subsequent use ---- [JMN] Contrary to the example in the man page (as at 2009-04 v 0.7.1); It appears that the default for csv::split is to use the (IMHO not exactly intuitive) system of converting a quoted empty string to a single quote. Using the same data as the csv_split example above, you need to use the '-alternate' flag to get the same output. % csv_split {a b c,"d e""f,",""} {a b c} {d e"f,} {} %csv::split -alternate {a b c,"d e""f,",""} {a b c} {d e"f,} {} As opposed to the default: %csv::split {a b c,"d e""f,",""} {a b c} {d e"f,} {"} At the very least there is an inconsistency in the documentation between the example and the description. I suppose converting "" to " may be useful in some cases to distinguish between Null and the empty string. [LV] note that the csv 0.7.2 man page says, as point 4 under FORMAT, that "" represents ". So at least the doc now lists that as the situation. Right below that point is a note that -alternate treats "" as an empty string. ---- [andy]: Hi, this proc can read complex csv file which contains ""","\n",",", and split row and cell. ====== proc readCSV { channel { header 1 } { symbol , }} { set quote 0 set row_temp "" set final "" ;# data list set symbol $symbol set header $header set data [ split [ read $channel nonewline ] \n ] set line_count [ llength $data ] if { $line_count != 0 } { for { set k 0 } { $k < $line_count } { incr k } { set line [ lindex $data $k ] set quote [ expr $quote + [ regexp -all \" $line ]] if { [ expr $quote % 2 ] == 0 } { set quote 0 append row_temp $line set row_temp [ split $row_temp $symbol ] set col_count [ llength $row_temp ] for { set i 0 } { $i < $col_count } { incr i } { set section [ lindex $row_temp $i ] set quote [ expr $quote + [ regexp -all \" $section ]] if { [ expr $quote % 2 ] == 0 } { append cell_temp $section if { [ regexp {"(.*)"} $cell_temp x y ] } { set cell_temp $y } lappend cell $cell_temp set cell_temp "" set quote 0 } else { append cell_temp $section, } } lappend final [ regsub -all {""} $cell \" ] set cell "" set row_temp "" } else { append row_temp $line\n } } } # generate array if needed, or return $final here set row [ llength $final ] set column [ llength [ lindex $final 0 ]] if { $header == 1 } { for { set i 0 } { $i < $row } { incr i } { for { set j 0 } { $j < $column } { incr j } { set csvData([ lindex [ lindex $final 0 ] $j ],$i) [ lindex [ lindex $final $i ] $j ] } } } else { for { set i 0 } { $i < $row } { incr i } { for { set j 0 } { $j < $column } { incr j } { set csvData($i,$j) [ lindex [ lindex $final $i ] $j ] } } } return [ array get csvData ] } ====== Example: Get cell by Header & row number. ====== set csv [ open c:/testcase.csv {RDWR} ] array set csvData [ readCSV $csv ] puts $csvData(Name,1) ;# assume there is a cell containing "Name" at first row. ====== Get cell by row & column number. ====== set csv [ open c:/testcase.csv {RDWR} ] array set csvData [ readCSV $csv 0 ] puts $csvData(3,1) ====== Contrast with csv package: ====== package require csv package require struct::queue set csv [ open c:/testcase.csv {RDWR} ] ::struct::queue q ::csv::read2queue $csv q set final [ q peek [ q size ]] ====== %|Capacity|readCSV|csv package|file size|% &|2000*4|193ms|168ms|768KB|& &|2000*8|368ms|330ms|1534KB|& &|2000*16|720ms|758ms|3065KB|& &|2000*32|1424ms|2066ms|6127KB|& &|2000*64|2815ms|6347ms|12252KB|& &|2000*128|5656ms|21806ms|24501KB|& ---- See: [CGI script to edit csv file] ---- !!!!!! %| [Category Package], subset of [Tcllib] | [Category Acronym] |% !!!!!!