2008-02-24 1. I keep a large collection of text data as a list in memory (lappend x {text...} etc) 2. I want to search this data. Here's what happens. set match [lsearch -regexp $x {needle}] -> memory usage of the tcl process more than doubles (before: 80MB, after: 200MB) (EDIT 2008-02-25: stays at 80 using -glob) foreach k $x { if {[regexp -nocase {needle} $k]} {puts "match"} - >ditto foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} -> Heureka! total memory usage stays at 80MB. I'm still not quite sure what's going on, it's about keeping lists 'pure' I guess. I'm now consulting these pages: [list] [shimmering] [pure list]. EDIT: please ignore the nonsense in the following paragraph, the confusion about what is a list and what isnt is gone now. 6am EDIT: I think I almost get it now. regexp treats $x as string and forces every element into a string representation AS WELL as a list representation. sigh. lsearch is forcing a string representation of all elements. seems unavoidable. I could also lappend items as strings not lists, but I'm saving my data as TCL source for various reasons, and of course { } is much cleaner in that case. -[hans] ---- I see a common misconception about Tcl here. Please notice using curly braces to quote a command argument *does not* turn the contents of the {...} into a list. IOW, your ''lappend x {text...} etc'' should result in exactly the same value as ''lappend x "text..." etc''. As to the original issue of memory growth, I would be curious to read what the core maintainers have to say. Also interesting would be a -glob search. - [RT] 24Feb08 Ah, as for the glob search, indeed there's no memory doubling with that, thanx for the hint. -[hans] 2008-02-25 ---- [Lars H]: Are you taking about a the object getting a string representation, or getting a String internal representation? (They're not the same.) For foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} to make a difference, it seems it'd have to be the latter (in order to get a string rep for `[[list $k]]`, one first needs the string rep of `$k`, but since `[[list $k]]` is not the same [Tcl_Obj] as `$k`, an intrep imposed upon `[[list $k]]` by [regexp] will not be retained in the elements of `$x`); that String intreps use 2 bytes for every character is consistent with a jump from 80MB to 200MB, if you have about 60M ASCII characters without intrep in your big list. I once had a similar difficulty, but the other way round, with a program that dumped a list of lists of integers to a text file. In order to generate a stringrep for a list of integers, Tcl first had to generate stringreps for all the integers, and that similarly doubled the memory usage during the final dump. In that case I solved it by feeding each integer through [format] before I did anything stringy to it, since that gave a [Tcl_Obj] that wasn't shared with the big list of lists of integers... ---- I can see a lot clearer now. I think. Maybe. '''internal''' stringreps are always 2bytes/char. But a "string intrep" is not a "int rep", right? I have to admit all this stuff and terminology (int reps, intreps, string intreps....) is mildly confusing. YA EDIT 2:30am: Digging around google groups, I found this excellent explanation: ''Objects may potentially have two different string reps, one of which is the UTF8 rep and the other of which is the UNICODE (UCS16) rep. What you say above is true only about the UTF8 rep (for various compatability reasons); the other rep is only calculated in response to operations which are more efficient when performed on data with a uniform character size (e.g. getting the length of a string, taking a substring, or performing a regexp match.)'' NOW things start to make sense. IMHO stuff like this belongs into the tcl user docs. -[hans] 2008-02-25 Actually, I have been greeping around in the tcl sources a bit now, and it seems the only place ''Tcl_UtfToUniCharDString'' is used is in the regexp module. String functions are not affected I think. -[hans] 2008-02-26 --- See also: * [Compact Data Storage] <>Uncategorized