Version 10 of Some observations on behind-the-scenes actions of Tcl

Updated 2008-02-24 23:38:26 by hans

2008-02-24

1. I keep a large collection of text data as a list in memory (lappend x {text...} etc)

2. I want to search this data.

Here's what happens.

  set match [lsearch -regexp $x {needle}]

-> memory usage of the tcl process more than doubles (before: 80MB, after: 200MB) (EDIT 2008-02-25: stays at 80 using -glob)

  foreach k $x { if {[regexp -nocase {needle} $k]} {puts "match"} 

- >ditto

  foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} 

-> Heureka! total memory usage stays at 80MB.

I'm still not quite sure what's going on, it's about keeping lists 'pure' I guess. I'm now consulting these pages:

list shimmering pure list.

6am EDIT: I think I almost get it now. regexp treats $x as string and forces every element into a string representation AS WELL as a list representation. sigh. lsearch is forcing a string representation of all elements. seems unavoidable. I could also lappend items as strings not lists, but I'm saving my data as TCL source for various reasons, and of course { } is much cleaner in that case.

-hans


I see a common misconception about Tcl here. Please notice using curly braces to quote a command argument *does not* turn the contents of the {...} into a list. IOW, your lappend x {text...} etc should result in exactly the same value as lappend x "text..." etc. As to the original issue of memory growth, I would be curious to read what the core maintainers have to say. Also interesting would be a -glob search. - RT 24Feb08

Ah, as for the glob search, indeed there's no memory doubling with that, thanx for the hint. Also, I'm not talking about values but about storage here. At the very least "lappend x {ABC}" results in only an internal list representation of the 3 bytes ABC being stored, no string rep, which is exactly what I want. Maximum memory efficiency. See below for more experimenting, especially

    eval "lappend a {$t}"

-hans 2008-02-25


Lars H: Are you taking about a the object getting a string representation, or getting a String internal representation? (They're not the same.) For

  foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} 

to make a difference, it seems it'd have to be the latter (in order to get a string rep for [list $k], one first needs the string rep of $k, but since [list $k] is not the same Tcl_Obj as $k, an intrep imposed upon [list $k] by regexp will not be retained in the elements of $x); that String intreps use 2 bytes for every character is consistent with a jump from 80MB to 200MB, if you have about 60M ASCII characters without intrep in your big list.

I once had a similar difficulty, but the other way round, with a program that dumped a list of lists of integers to a text file. In order to generate a stringrep for a list of integers, Tcl first had to generate stringreps for all the integers, and that similarly doubled the memory usage during the final dump. In that case I solved it by feeding each integer through format before I did anything stringy to it, since that gave a Tcl_Obj that wasn't shared with the big list of lists of integers...


Very well explained, makes sense to me. Let's play around a little bit more.

   set t [string repeat "a" 1000]

Nr.1

   for {set x 0} {$x<50000} {incr x} { lappend a $t  }

-> mem allocated: 413696 (ok, only references)

Nr.2

   for {set x 0} {$x<50000} {incr x} { lappend a [list $t] }

-> mem allocated: 3153920 (not what I thougt it would do)

Nr.3

   for {set x 0} {$x<50000} {incr x} { lappend a [split $t "\x000"] }

-> mem allocated: 55214080 (closer)

Nr.4

   for {set x 0} {$x<50000} {incr x} { eval "lappend a {$t}" }

-> mem allocated: 52420608 (nice)

-hans 2008-02-24

--- See also: