'''tarray - Typed Array extension for Tcl''' <> Project is hosted at SourceForge [http://sourceforge.net/projects/tarray]. Latest download is V0.6. New commands, bug fixes, and more operations parallelized. Web site and documentation is at http://tarray.sourceforge.net **Overview** The tarray extension implements a new Tcl collection data type - typed array - and associated commands `column` and `table`. A typed array stores elements of a specific data type in native format. The primary motivation for this extension is efficient memory utilization and speed of certain operations in applications dealing with very large numbers of elements. This is achieved through native storage formats and parallelization of operations on multi-core CPU's. See the benchmarks below. The tarray extension was inspired in part by [Speed Tables] and to a lesser extent by [TclRal]. The philosophy behind tarray is to provide efficient facilities on top of which more sophisticated data structures, possibly customized for specific applications, can be easily scripted and experimented with. Therefore, unlike [Speed Tables], tarray does not require creation and recompilation of a new extension for each table definition. Moreover, tarray provides value-based semantics so that columns and tables can be used as basic building blocks. Additional facilities that [Speed Tables] provides, like remote access, are expected to be implemented at the script level. **Benchmarks** As expected, benefits of native format and parallelization can be significant (more than two orders of magnitude for searches) as shown in the benchmarks below. ***Sorting*** The lsort column is the baseline showing the performance of lsort for each data type (using the -real, -integer etc. options). The numbers in parenthesis show performance relative to the lsort baseline. The next two columns show tarray performance without parallelizing and parallelized to 2 threads. ====== Version: 0.6 Run date: Thu Aug 14 10:20:06 IST 2014 Size Type lsort singlethread multithread 10000 strings 4159 3776 (1.10) 2367 (1.76) 10000 any 4104 3787 (1.08) 2675 (1.53) 10000 doubles 3087 1377 (2.24) 895 (3.45) 10000 ints 2888 1100 (2.62) 733 (3.94) 100000 strings 85702 55741 (1.54) 40372 (2.12) 100000 any 84130 74371 (1.13) 63067 (1.33) 100000 doubles 47588 17283 (2.75) 10354 (4.60) 100000 ints 44223 13729 (3.22) 7770 (5.69) 1000000 strings 1407869 816009 (1.73) 635882 (2.21) 1000000 any 1379399 1204986 (1.14) 1015875 (1.36) 1000000 doubles 781096 214341 (3.64) 132427 (5.90) 1000000 ints 743424 148959 (4.99) 93574 (7.94) ====== ***Searching*** Similar to above, for lsearch baseline using -all -exact/-integer/-real ====== Size Type lsearch singlethread multithread 10000 strings 200 48 (4.17) 82 (2.44) 10000 any 238 270 (0.88) 253 (0.94) 10000 doubles 10414 30 (347.13) 126 (82.65) 10000 ints 1460 12 (121.67) 53 (27.55) 100000 strings 6088 631 (9.65) 521 (11.69) 100000 any 5756 8453 (0.68) 4926 (1.17) 100000 doubles 92572 253 (365.90) 197 (469.91) 100000 ints 22501 123 (182.93) 132 (170.46) 1000000 strings 71039 5345 (13.29) 4315 (16.46) 1000000 any 74739 107890 (0.69) 70007 (1.07) 1000000 doubles 951024 2115 (449.66) 1164 (817.03) 1000000 ints 280801 986 (284.79) 558 (503.23) ====== ***Real world example*** The following example compares storing of geographical data for cities from http://geonames.org as a list, as a [sqlite] in-memory database, and as a tarray table. The database has just over 142000 records. The corresponding cities table definition is shown below. ====== tarray::table create { geonameid int name string country string latitude double longitude double population wide elevation int } ====== The table is queried for all cities with more than a million people. Memory usage and timing is shown below. It shows even from the memory usage point of view, lists are not suitable for more than a few hundred thousand records. Note tarray uses twice as much memory as sqlite but is two orders of magnitude faster on the search. '''This is not to say sqlite and tarray are comparable! sqlite is a database, tarray is not!''' Nevertheless, for use as an in-memory data structure, tarray can replace some uses of sqlite. [APN] is somewhat surprised with the difference in speed. Any hints on optimizing the sqlite query would be appreciated. It would have been nice to also compare [Speed Tables] but that does not build on Windows and I do not have a Linux benchmarking host. ====== list results: Virtual Bytes: 112 MB Working Set: 111 MB Page file: 112 MB sqlite results: {db eval {select name from geo where population > 1000000}} Virtual Bytes: 17 MB Working Set: 9 MB Page file: 9 MB 23142.2 microseconds per iteration tarray results: {table get -columns name $tab [column search -all -gt [table column $tab population] 1000000]} Virtual Bytes: 40 MB Working Set: 18 MB Page file: 19 MB 187.7 microseconds per iteration ====== <> [SEH] -- It would be useful to be able to do dumps of raw binary data once a TArray had been constucted. Then one could, for example, use it with a [reflected channel] to duplicate the function of [memchan]. Or to do device I/O. (Is this already possible?) [APN] Direct I/O files/databases to and from typed arrays is on the to-do list but low down for a couple of reasons. First, there are still a bunch of basic operations and optimizations that have to be implemented to make the package more useful as a building block. Second, it is not clear what the output / input format should be. Even just binary dumps raise questions of endianness etc. ---- '''[AK] - 2013-04-18 22:18:04''' Tcllib already has a memchan emulation based on reflected channels. Documentation @ https://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/virtchannel_base/tcllib_memchan.html SEH -- Since that package is pure Tcl, I was hoping a TArray-based solution would be faster. AK -- What makes this then different from the original memchan ? SEH -- Nothing at all, except that the author states above that TArray is intended to be a modular base for a range of specific solutions. If the function of memchan could be duplicated, that would be one less purpose-specific package to maintain, replaced by a flexible multi-purpose tool. I think it would be healthier for the Tcl ecosystem to have fewer of the former and more of the latter, for an equivalent range of applications. <> Command | Package | Data Structure