Counting a million lines

(abstracted from http://www.deja.com/=dnc/viewthread.xp?AN=519463087 ); factored out to here by RS


Need a cross-platform way to get a count of lines in a file (thus eliminating anything like "exec wc -l")?? This proc works, but is excruciatingly slow on large files:

  proc linecount {file} {
        set i 0
        set fid [open $file r]
        while {[gets $fid line] > -1} {incr i}
        close $fid
        return $i
        }

For a 1,000,000 line file this takes almost a minute to run on an Ultra 1. wc -l takes 4-5 seconds. Increasing the buffersize using fconfigure also does not help.

Donal Fellows replies:

Hmm. A million lines is a lot of data, and so we want to try to avoid copying it as much as we can (since copying is very slow when done loads of times.) Working in large chunks is good too (since syscalls are slow too.)

   proc linecount2 {file {eolchar "\n"}} {
       set i 0
       set f [open $file]
       # Use a 512K buffer.
       fconfigure $f -buffersize 524288 -translation binary
       while {![eof $f]} {
           incr i [regsub -all $eolchar [read $f 524288] $eolchar junk]
       }
       close $f
       return $i
   }

Note that when the input data is coming from a Mac, you will need to pass eofchar as "\r" instead of "\n", since I disable the normal conversion stuff for speed. The above code seems to be substantially faster than your version, though not quite as fast as [exec wc -l]. On a handy 4MB/100kLine file (with a very wide mix of line lengths), my timings (Sun Ultra 5/Solaris 2.7/Tcl 8.0.4 with data coming over NFS) were:

   % time {linecount  $testfile}
   3883065 microseconds per iteration
   % time {linecount2 $testfile}
   565478 microseconds per iteration
   % time {exec wc -l $testfile}
   472118 microseconds per iteration

I don't know about speeds with later versions, since I've not upgraded yet...

The main tunable parameter is how much data to slurp in at once. That is probably best set to something that is a multiple of the filesystem chunk size (e.g. the size of a cluster on FAT systems) and 512KB fits the bill fairly well while not taking too much memory. Larger sizes are potentially faster (and slurping in the whole lot in one call is ideal) but you've probably got too much data for that to be a serious proposition.

Unfortunately, my larger datasets (5Mlines and more) are all compressed, and thus wouldn't be fair tests since most of the wall-clock time would be spent elsewhere.

DKF

Eep! If you actually did this just like that, with data coming over NFS, then at least *part* of the first pass was reading the file over NFS, and the other 2 probably read the file out of memory. My suspicion, however, is that it was read out of memory for all 3 since I doubt you did this *cold*.

I repeated your test on a file I made by concatenating copies of Moby Dick until I got 7MB / 100,000 lines, and my timings were not so dramatically different when using tcl-8.3.3:

   % time {linecount  $testfile}
   4564520 microseconds per iteration
   % time {linecount2 $testfile}
   3414038 microseconds per iteration
   % time {exec wc -l $testfile}
   115588 microseconds per iteration

Sun Ultra 10/Solaris 2.8/Tcl 8.3.3 Local filesystem, but file cached in memory for certain. -PSE


As of Tcl 8.3 (or so) you can trim a few more milliseconds and copies out of this with the regexp -all command:

    incr i [regexp -all $eolchar [read $f 524288]]

This saves shuffling around the results of the regsub which are discarded anyways.

Dave Griffin


I tried that approach on http://www.bagley.org/~doug/shootout/bench/wc/wc.tcl but it was slower than

    incr i [llength [split [read $f 524288] "\n"]]

Note that there the buffer is much smaller (4K), and that may change the relative speeds of these two approaches.

MS


I tested the two approaches using data stored in memory. "lines" is a file containing 10 million lines.

 proc CountLines {s} { 
    return [llength [split $s "\n"]];
 }

 proc linecount2 {s} {
    return [regsub -all "\n" $s "\n" junk]
 }

 set fh [open lines r]
 set tdata [read $fh]
 close $fh
 puts [time {CountLines $tdata}]
 puts [time {linecount2 $tdata}]

I ran it twice, the second time with the two procedures in the opposite order. The figures shown are averages of the two runs.

CountLines: 848931 microseconds

linecount2: 1256970 microseconds

The split-based approach takes only 2/3 the time of the regsub-based approach. Regular expressions are very useful but they are often not the most efficient approach. WJP


I haven't tried it, but perhaps some of the tricks in Counting characters in a string might help BAS


RonInHouston - 2013-03-14 19:26:18

When the "split" function is used, it adds an empty item to the list after the last newline, causing llength to over-state the line count by one.