Version 11 of TArray

Updated 2013-05-25 17:13:58 by APN

tarray - Typed Array extension for Tcl

Note that this extension is very much in an experimental stage. It should be robust enough with a fairly comprehensive test suite. However, the API is incomplete and subject to change.

Project is hosted at SourceForge [L1 ].

Web site and documentation is at http://tarray.sourceforge.net .

The tarray extension implements a new Tcl collection data type - typed array - and associated commands column and table. A typed array stores elements of a specific data type in native format. The primary motivation for this extension is efficient memory utilization and speed of certain operations in applications dealing with very large numbers of elements. Typed columns and tables do not really provide any additional capabilities over Tcl's built-in commands, particularly list and dict and the script libraries based on these. Their primary use is in applications with large number of data items where memory consumption and cost of operations, such as searching and sorting, are greatly reduced through the use of native types and parallelization (see preliminary numbers below).

The tarray extension was inspired in part by Speed Tables and to a lesser extent by TclRal.

The philosophy behind tarray is to provide efficient facilities on top of which more sophisticated data structures, possibly customized for specific applications, can be easily scripted and experimented with. Therefore, unlike Speed Tables, tarray does not require creation and recompilation of a new extension for each table definition. Moreover, tarray provides value-based semantics so that columns and tables can be used as basic building blocks. Additional facilities that Speed Tables provides, like remote access, are expected to be implemented at the script level.

2013-05-25 The repository now has V0.4 (though the downloads area has not been updated accordinly) which includes an experimental parallelized sort. Initial non-scientific benchmarks results show tarray sorts being more than 6 times as fast as array sizes grow larger.

Version:  0.4
Run date: Sat May 25 22:16:21 IST 2013
     Size       Type      lsort     singlethread      multithread         lsort+mt
     1000    strings        936        987 (0.95)       1264 (0.74)       1076 (0.87)
     1000    doubles        755        409 (1.85)        400 (1.89)        493 (1.53)
     1000       ints        719        407 (1.77)        321 (2.24)        415 (1.73)

    10000    strings      13189      12216 (1.08)       8205 (1.61)       8465 (1.56)
    10000    doubles       9966       4664 (2.14)       2808 (3.55)       3070 (3.25)
    10000       ints       9092       4114 (2.21)       2504 (3.63)       3201 (2.84)

   100000    strings     217192     195663 (1.11)     150872 (1.44)     140256 (1.55)
   100000    doubles     139873      60475 (2.31)      39316 (3.56)      34345 (4.07)
   100000       ints     126951      46944 (2.70)      25462 (4.99)      33753 (3.76)

  1000000    strings    3274015    2884962 (1.13)    1910492 (1.71)    1951949 (1.68)
  1000000    doubles    2003150     679919 (2.95)     374274 (5.35)     411236 (4.87)
  1000000       ints    1822994     512294 (3.56)     282488 (6.45)     371050 (4.91)

The lsort column is the baseline showing the performance of lsort for each data type (using the -real, -integer etc. options). The numbers in parenthesis show performance relative to the lsort baseline. The next two columns show tarray performance without parallelizing and parallelized to 2 threads. The last column shows the performance of a wrapper that converts a list to a tarray and back. Note this last column shows that even if you prefer to keep your data as standard lists, a two way transform to do a sort is very much worthwhile from a performance perspective.

SEH -- It would be useful to be able to do dumps of raw binary data once a TArray had been constucted. Then one could, for example, use it with a reflected channel to duplicate the function of memchan. Or to do device I/O. (Is this already possible?)

APN Direct I/O files/databases to and from typed arrays is on the to-do list but low down for a couple of reasons. First, there are still a bunch of basic operations and optimizations that have to be implemented to make the package more useful as a building block. Second, it is not clear what the output / input format should be. Even just binary dumps raise questions of endianness etc.


AK - 2013-04-18 22:18:04

Tcllib already has a memchan emulation based on reflected channels. Documentation @ https://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/virtchannel_base/tcllib_memchan.html

SEH -- Since that package is pure Tcl, I was hoping a TArray-based solution would be faster.

AK -- What makes this then different from the original memchan ?

SEH -- Nothing at all, except that the author states above that TArray is intended to be a modular base for a range of specific solutions. If the function of memchan could be duplicated, that would be one less purpose-specific package to maintain, replaced by a flexible multi-purpose tool. I think it would be healthier for the Tcl ecosystem to have fewer of the former and more of the latter, for an equivalent range of applications.