Some observations on behind-the-scenes actions of Tcl

2008-02-24

1. I keep a large collection of text data as a list in memory (lappend x {text...} etc)

2. I want to search this data.

Here's what happens.

  set match [lsearch -regexp $x {needle}]

-> memory usage of the tcl process more than doubles (before: 80MB, after: 200MB) (EDIT 2008-02-25: stays at 80 using -glob)

  foreach k $x { if {[regexp -nocase {needle} $k]} {puts "match"} 

- >ditto

  foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} 

-> Heureka! total memory usage stays at 80MB.

I'm still not quite sure what's going on, it's about keeping lists 'pure' I guess. I'm now consulting these pages:

list shimmering pure list.

EDIT: please ignore the nonsense in the following paragraph, the confusion about what is a list and what isnt is gone now.

6am EDIT: I think I almost get it now. regexp treats $x as string and forces every element into a string representation AS WELL as a list representation. sigh. lsearch is forcing a string representation of all elements. seems unavoidable. I could also lappend items as strings not lists, but I'm saving my data as TCL source for various reasons, and of course { } is much cleaner in that case.

-hans


I see a common misconception about Tcl here. Please notice using curly braces to quote a command argument *does not* turn the contents of the {...} into a list. IOW, your lappend x {text...} etc should result in exactly the same value as lappend x "text..." etc. As to the original issue of memory growth, I would be curious to read what the core maintainers have to say. Also interesting would be a -glob search. - RT 24Feb08

Ah, as for the glob search, indeed there's no memory doubling with that, thanx for the hint. -hans 2008-02-25


Lars H: Are you taking about a the object getting a string representation, or getting a String internal representation? (They're not the same.) For

  foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} 

to make a difference, it seems it'd have to be the latter (in order to get a string rep for [list $k], one first needs the string rep of $k, but since [list $k] is not the same Tcl_Obj as $k, an intrep imposed upon [list $k] by regexp will not be retained in the elements of $x); that String intreps use 2 bytes for every character is consistent with a jump from 80MB to 200MB, if you have about 60M ASCII characters without intrep in your big list.

I once had a similar difficulty, but the other way round, with a program that dumped a list of lists of integers to a text file. In order to generate a stringrep for a list of integers, Tcl first had to generate stringreps for all the integers, and that similarly doubled the memory usage during the final dump. In that case I solved it by feeding each integer through format before I did anything stringy to it, since that gave a Tcl_Obj that wasn't shared with the big list of lists of integers...


I can see a lot clearer now. I think. Maybe. internal stringreps are always 2bytes/char. But a "string intrep" is not a "int rep", right? I have to admit all this stuff and terminology (int reps, intreps, string intreps....) is mildly confusing.

YA EDIT 2:30am: Digging around google groups, I found this excellent explanation:

Objects may potentially have two different string reps, one of which is the UTF8 rep and the other of which is the UNICODE (UCS16) rep. What you say above is true only about the UTF8 rep (for various compatability reasons); the other rep is only calculated in response to operations which are more efficient when performed on data with a uniform character size (e.g. getting the length of a string, taking a substring, or performing a regexp match.)

NOW things start to make sense. IMHO stuff like this belongs into the tcl user docs.

-hans 2008-02-25

Actually, I have been greeping around in the tcl sources a bit now, and it seems the only place Tcl_UtfToUniCharDString is used is in the regexp module. String functions are not affected I think.

-hans 2008-02-26

--- See also: