'''Comma Separated Values''', or '''CSV''', is a textual format for table/record data. ** Description ** The venerable CSV format has been used at least since the days of the first spreadsheets such as VisiCalc as a common export format for table data such as spreadsheets. It uses commas to separate '''fields''', and native line-ending character(s) of the host system to separate '''records'''. Commas and newlines can be escaped with double quotes, which in turn can be escaped by prepending another double quote character to. Two consecutive commas indicate an empty field. These plain text files are a common way to export data from spreadsheets or row-based databases. CSV started out as an ad-hoc format, and many ad-hoc flavours of CSV exist in the wild. The [Microsoft] [Excel] implementation is widely considered as the reference implementation. Some locales require the use of other delimiter characters. In the German locales the semicolon is used in csv-files as field delimiter as the comma is used as decimal separator. Some applications permit allow the field and/or record delimiters to be specified, giving rise to the general term, '''Delimiter-Separated Values'''. Some applications have special quoting defined for commas and line ending characters. An example of a csv format: ======none Cabbage,1.49,1,Kroger Celery,1.19,1,Meijer ====== Unlike the more popular today [XML], the consuming application will be making a lot of assumptions about the record layout, element format, etc. Where a comma character is part of the record value, it can be escaped using double quotes: ======none "Cream cheese, flavored",2.68,12,"Giant Eagle" ====== Some programs only quote values when necessary, but other programs quote all values containing commas or spaces, and some programs simply quote all values. Other programs don't support value quoting at all, and values containing comma or the record delimiter are not allowed. ''[ZB]: That's why CSV "tab-delimited" variant is soooo much better... quite nothing to worry about, since there's no need to quote anything at all. [AMG]: Until you wind up with a tab or newline inside your data. The true test of any encoding is whether it can be nested, so picking different delimiters will not solve the problem, short of using [MIME]-style randomized boundaries.'' The record delimiter can also be quoted: ====== "Cream cheese, flavored",2.68,12,"Giant Eagle" ====== CSV parsers, particularly quickly-coded, ad-hoc parsers, don't support such complexities. Double quotes generally encapsulate an entire value, although parsers surely exist that interpret double quotes embedded in values. A double quote that is part of a value can be escaped by prepending another double quote to it: ======none """I can't believe it's not butter""",3.45,12,"Big Bear" ====== ''[AMG]: ... it will be necessary to double up quotes internal to the element even when the element isn't contained in quotes.'' Some programs indicate an empty element by no data, while others indicate it by an empty pair of quotes: ======none Batteries,4.47,,Kroger Batteries,4.47,"","Big Bear" ====== ''[AMG]: ... but this is ambiguous between an empty cell or a cell consisting of a single `"` character. Better to not use `""` to indicate an empty. However, you're still in trouble if you want to distinguish between an empty row and a row consisting of a single empty cell.'' **[ESR]'s criticism of CSV** [AMG]: [http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2901882%|%The Art of Unix Programming], by Eric S Raymond, on CSV: : "In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field." Sigh! Perhaps as a demonstration of how confusing CSV is, [ESR] is wrong to say that fields containing double quotes must themselves be enclosed in double quotes. ** Case Study: Folk Music in Sweden ** [PL]: Tcl + Tcllib's csv module saved the largest amateur database of folk music in Sweden (>100k tunes). The collector was using an old database custom-written in Turbo Pascal, and as the computers were beginning to go decrepit, he was wishing that he could migrate to Access. Enter me with my Tcl CD. I wrote a probe to find out the database format as he was watching (all sources and specifications were lost), then a dumper to translate it into a CSV file, and imported into Access. Now he thinks that content management is ''really'' easy. **[regexp]-based CSV parser, by [AMG]** [AMG]: Here's a CSV splitter implemented using [[[regexp]]]: ======none proc csvRowSplit {row} { set result {} foreach {_ unquoted quoted} [regexp -all -inline -expanded { ,((?:[^,\"]|"")*)|,\"((?:[^\"]|"")*)\" } ,$row] { if {$quoted ne {}} { lappend result [string map {\"\" \"} $quoted] } else { lappend result [string map {\"\" \"} $unquoted] } } return $result } ====== The following splits a file into a list of rows, where each row is in CSV format and ready to be split further by [[csvRowSplit]]: ======none proc csvFileSplit {file} { set result {} foreach row [regexp -all -inline -expanded { \n(?:|(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\") (?:,(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\"))*) } \n$file] { lappend result [string range $row 1 end] } return $result } ====== Here is a combined form: ====== proc csvSplit {file} { set result {} foreach row [csvFileSplit $file] { lappend result [csvRowSplit $row] } return $result } ====== Or using [[[lmap]]]: ====== proc csvSplit {file} { lmap row [csvFileSplit $file] {csvRowSplit $row} } ====== Or spelled out: ======none proc csvSplit {file} { set fileResult {} foreach row [regexp -all -inline -expanded { \n(?:|(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\") (?:,(?:(?:[^\n\",]|"")*|\"(?:[^"]|"")+\"))*) } \n$file] { set rowResult {} foreach {_ unquoted quoted} [regexp -all -inline -expanded { ,((?:[^,\"]|"")*)|,\"((?:[^\"]|"")*)\" } [string replace $row 0 0 ,]] { if {$quoted ne {}} { lappend rowResult [string map {\"\" \"} $quoted] } else { lappend rowResult [string map {\"\" \"} $unquoted] } } lappend fileResult $rowResult } return $fileResult } ====== **Character-by-character CSV Parser, by [AMG]** ====== proc csv_split {line} { # Process each input character. set result [list] set beg 0 while {$beg < [string length $line]} { if {[string index $line $beg] eq "\""} { incr beg set quote false set word {} foreach char [concat [split [string range $line $beg end] {}] {{}}] { # Search forward for the closing quote, one character at a time. incr beg if {$quote} { if {$char in {, {}}} { # Quote followed by comma or end-of-line indicates the end of # the word. break } elseif {$char eq "\""} { # Pair of quotes is valid. append word $char } else { # No other characters can legally follow quote. I think. error "extra characters after close-quote" } set quote false } elseif {$char eq {}} { # End-of-line inside quotes indicates embedded newline. error "embedded newlines not supported" } elseif {$char eq "\""} { # Treat the next character specially. set quote true } else { # All other characters pass through directly. append word $char } } lappend result $word } else { # Use all characters up to the comma or line ending. regexp -start $beg {.*?(?=,|$)} $line word lappend result $word set beg [expr {$beg + [string length $word] + 1}] } } # If the line ends in a comma, add a blank word to the result. if {[string index $line end] eq ","} { lappend result {} } # Done. Return the result list. return $result } ====== ''[AMG]: Partially reverted [PYK]'s edit [http://wiki.tcl.tk/_/revision?N=2215&V=55] to avoid angering the deficiencies of this wiki's syntax highlighter'' ''[PYK]: Note to self: keep the wiki syntax highlighter happy. [AMG], I see that you also corrected an error I introduced by not paying attention to the difference between the empty string and a list containing a single empty string -- sorry about that!'' Examples: ======none % csv_split {a b c,"d e f,",} {a b c} {d e f,} {} % csv_split {a b c,"d e f,",g} {a b c} {d e f,} g % csv_split {a b c,"d e""f,",g} {a b c} {d e"f,} g % csv_split {a b c,"d e""f,","g"} {a b c} {d e"f,} g % csv_split {a b c,"d e""f,",""} {a b c} {d e"f,} {} % csv_split {a b c,"d e""f,","",""} {a b c} {d e"f,} {} {} % csv_split {a b c,"d e""f,","","",","} {a b c} {d e"f,} {} {} , % csv_split {a b c,"d e""f,","","",",",,,} {a b c} {d e"f,} {} {} , {} {} {} ====== As you can see, `csv_split` is hard-wired to support comma as the delimiter and double quote as the quoting operator. This can be changed easily if needed. `csv_split` does not support embedded newlines, but at least it recognizes them. It's forced to error on embedded newlines because presumably the caller won't pass it enough characters to complete the record. The caller is assumed to be a `[gets]` loop, which has no way of knowing whether any particular newline is quoted or not. To fix, make a version of `csv_split` that takes all input characters at once and returns a list of lists. [AMG]: Hmm, or create a stateful `[coroutine]`! Pass it an incomplete row, it returns nothing. Pass it the rest of the row, it returns the complete row. Pass it one and a half rows, it returns the complete row and buffers the rest. ** CSV Parser, by [andy] ** [andy]: Hi, this proc can read complex csv file which contains `"""`, `\n`, `,`, and split row and cell. ======none proc readCSV {channel {header 1} {symbol ,}} { set quote 0 set data [split [read $channel nonewline] \n] set line_count [llength $data] if {$line_count != 0} { for {append k 0} {$k < $line_count} {incr k} { set line [lindex $data $k] set quote [expr {$quote + [regexp -all \" $line]}] if {$quote % 2 == 0} { set quote 0 append row_temp $line set row_temp [split $row_temp $symbol] set col_count [llength $row_temp] for {set i 0} {$i < $col_count} {incr i} { set section [lindex $row_temp $i] set quote [expr {$quote + [regexp -all \" $section]}] if {$quote % 2 == 0} { append cell_temp $section if {[regexp {"(.*)"} $cell_temp x y]} { set cell_temp $y } lappend cell $cell_temp unset cell_temp set quote 0 } else { append cell_temp $section, } } lappend final [regsub -all {""} $cell \"] unset cell unset row_temp } else { append row_temp $line\n } } } # generate array if needed, or return $final here set row [llength $final] set column [llength [lindex $final 0]] if {$header == 1} { for {set i 0} {$i < $row} {incr i} { for {set j 0} {$j < $column} {incr j} { set csvData([lindex [lindex $final 0] $j],$i) [ lindex [lindex $final $i] $j] } } } else { for {set i 0} {$i < $row} {incr i} { for {set j 0} {$j < $column} {incr j} { set csvData($i,$j) [lindex [lindex $final $i] $j] } } } return [array get csvData] } ====== Example, get cell by header & row number: ====== set csv [open c:/testcase.csv RDWR] array set csvData [readCSV $csv] # assume there is a cell containing "Name" at first row. puts $csvData(Name,1) ====== Example, get cell by row & column number: ====== set csv [open c:/testcase.csv {RDWR}] array set csvData [readCSV $csv 0] puts $csvData(3,1) ====== Contrast with csv package: ====== package require csv package require struct::queue set csv [ open c:/testcase.csv {RDWR} ] ::struct::queue q ::csv::read2queue $csv q set final [ q peek [ q size ]] ====== %|Capacity|readCSV|csv package|file size|% &|2000*4|103ms|170ms|768KB|& &|2000*8|200ms|335ms|1534KB|& &|2000*16|382ms|770ms|3065KB|& &|2000*32|760ms|2088ms|6127KB|& &|2000*64|1501ms|6411ms|12252KB|& &|2000*128|2995ms|21841ms|24501KB|& [AMG]: This is not my code. It was contributed by another [andy]. I used [andy] to mark my pages long ago but now use [AMG] to avoid confusion. But perhaps in this case I engendered confusion by having changed! ** Misc ** [LV]: So, anyone have a pointer to examples of visually displaying a csv files? There are at least two different styles I can think of: * display a csv file as data to be editted and then resaved * display a csv file where each "row" is an item (with description, etc.) that might be selected for subsequent use ---- [JMN]: Contrary to the example in the man page (as at 2009-04 v 0.7.1); It appears that the default for `csv::split` is to use the (IMHO not exactly intuitive) system of converting a quoted empty string to a single quote. Using the same data as the csv_split example above, you need to use ''-alternate'' to get the same output. ======none % csv_split {a b c,"d e""f,",""} {a b c} {d e"f,} {} ====== ======none %csv::split -alternate {a b c,"d e""f,",""} {a b c} {d e"f,} {} ====== As opposed to the default: ======none %csv::split {a b c,"d e""f,",""} {a b c} {d e"f,} {"} ====== At the very least there is an inconsistency in the documentation between the example and the description. I suppose converting `""` to `"` may be useful in some cases to distinguish between Null and the empty string. [LV]: note that the csv 0.7.2 man page says, as point 4 under FORMAT, that `""` represents `"`. So at least the doc now lists that as the situation. Right below that point is a note that ''-alternate'' treats `""` as an empty string. [PL] 2014-11-12: note that RFC4180 does not mention null values or how to distinguish them from empty strings. The `""` to `"` convention is horribly ugly IMHO, but it has some heavy-weight support (e.g. in OWC) so it's better than nothing. ** Tools ** [Tcllib CSV]: CSV processing package included with [Tcllib] [CSV Sql]: implements SQL-like queries on CSV files [Parsing CSV Strings]: ===none What: '''catdoc''' Where: http://mac-www.fe.msk.ru/%7Evitus/catdoc/ Description: Extract text from MS-Word (Up through Word 97) files and tries to preserve as many special printable characters as possible. Includes a small Tk based GUI called wordview as well as a tool called xls2csv for getting comma separated value files out of Excel spreadsheets. Updated: 02/1999 Contact: mailto:bjorn@debian.org (Bjorn Brenander) === ===none What: '''Wagner extensions''' Where: http://www.ice.ru/%7Evitus/works/tcl.html http://www.fe.msk.ru/%7Evitus/works/ http://www.ice.ru/%7Evitus/tcl/locale-tcl.html Description: Series of Tcl extensions. Includes an interface to syslog (which supports Tcl 7.x/8.0.x and now Tcl 8.1 and newer), Cyrillic recoding, plain DBF reading, CSV file handling, tcl split procedure that works like perl's split, a patch to Tcl's locale handling, command line option parsing (Tk getopt like), and conversion of nroff output into text chunks. Users should check the web site as new releases occur as needed. Updated: 06/2000 Contact: mailto:vitus@ice.ru (Victor Wagner) === **References** [RFC] [https://www.rfc-editor.org/info/rfc4180%%|%4180], Common Format and MIME Type for Comma-Separated Values (CSV) Files: formally specifies a CSV format and a corresponding [MIME type]. [http://en.wikipedia.org/wiki/Comma-separated_values%|%Wikipedia]: ** See Also ** [XML]: [http://c2.com/cgi/wiki?RapidFile%|%RapidFile]: a non-relational flat-file database written in MMSFORTH in the early days of personal computers [Formatting file information in CSV]: [Parsing csv strings]: [Parsing csv file]: [How to plot a graph]: [report]: [Anyone interested in doing a comma delimited database?]: [CGI script to edit csv file]: <> Package | Tcllib | Glossary