This page is here to collect tips and tricks that are useful while hunting memory leaks at the C level, in the Tcl core or in extensions. To track script-level leaks (like lingering keys in global arrays, objects, channels, etc), see sibling page "Leak Hunt (Tcl level)".

-----

** Basic Valgrind Setup **

Valgrind is a very powerful tool. Its oldest component, "memcheck", is key to hunting leaks, buffer overflows, uninitialized values, etc. One nice thing is that it works on unmodified executables (no need to recompile in a dedicated mode). The basic incantation is:

======

  valgrind executable arguments...

======

It will produce a summary telling how dynamically allocated memory is used when exiting the process: definitely lost, indirectly lost, possibly lost, or still reachable. Additional flags can give further details, see valgrind(1).

** Valgrinding Tcl **

In the context of Tcl, a few things are worth noticing to get an efficient leak hunt:

   * The block allocator ("zippy") is very efficient, but also efficiently hides all individual Tcl_Obj leaks from valgrind... So it is important to use -DPURIFY, which switches back to a per-object malloc scheme.

   * Memdebug builds ([memory] command) instrument allocation in such a way that many things that should appear as "definitely lost" show up as "still reachable" (because they are all linked through the user-level heap metadata). So avoid --enable-symbols=mem|all.

   * Exit has been streamlined in 8.6 (see bug 2001201). That 'exit reform' follows the idea that when a program says [exit], it essentially wants the OS's exit(), instead of complicated unwinding of all its data structures. So many finalization tasks in Tcl are now out of the normal [exit] codepath, to the benefit of both performance and stability (much fewer deadlocks on exit). The result is that a valgrind of a normal build will normally show lots of "still reachable" blocks. However, as of 20110809, a -DPURIFY build will request a full finalization on exit: whatever valgrind reports after that, is a true leak.

Taking these facts into account, the typical setup becomes:

======

  env CFLAGS=-DPURIFY ./configure --enable-symbols && make clean && make
  valgrind --leak-check=yes ./tclsh somescript.tcl args...

======

** Dealing the deathblow **

Once a leak is spotted, and valgrind has given you a detailed C-level stack trace of the point of allocation... of course it's time to switch to gdb. Various kind of allocations request varied techniques, but a rather frequent class is that of refcounted things (like Tcl_Obj's). Here gdb's watchpoint tool is very useful:

   * breakpoint just after the allocation
   * set a watchpoint on ((ProperType *)0xHEX_ADDRESS_OF_OBJECT)->refCount
   * hit "cont" repeatedly and follow its Incr and Decr
   * it never hits zero (it's a leak, remember). Usually ends up at 1.
   * replay the scenario while taking note of where the extra refs are kept on Incr, and which are discarded on Decr
   * In the end you'll know whether:
   ** a Decr is simply missing
   ** a reference is mistakenly overwritten
   ** a reference is kept in other stuff that leaks too (in that case the block is "indirectly lost")

** Reference Loops **

One Thing That Should Never Occur in Tcl is a cycle in the graph of references among Tcl_Objs. Indeed, since we depend on refcounting for memory management, a cycle is an absolute show stopper. Fortunately, by design, the language is strictly unable to produce such "reference loops": the copy-on-write principle prevents all in-place operations which would "close the loop". Any exception to this rule is a bug, not in the script, but in the core.

But here we're discussing core debugging, right ? So these things may happen. It happened to me in Bug 3386417, where the newly introduced [info errorstack] (TIP #348) was somehow plugged back into a compiled scriplet that it referred to. The interesting generalization is as follows:

If:
   * you get a "definitely lost" object
   * you know that references to it remain
   * you know that these referers are still alive too, though they shouldn't
   * these referers end up in "indirectly lost" (not "still reachable", nor "definitely lost")

Then it is very likely that you have a Reference Loop.
Of course you're on your own to actually track it down, but knowing the mere existence of their kind may save you hours (it would have, in my case :/ ).

-----

** When valgrind isn't enough **

Sometimes, the above techniques are insufficient. For example, valgrind may pinpoint the full stack trace of the offending allocation... but if the same stack happens many times in the program's lifespan, it is hard to identify the interesting one.
In that case, switching back to Tcl's own memdebug mode is helpful: revert the above advice and do

======
./configure --enable-symbols=mem
======

The [memory] command is then the tool of choice to get to individual leaked blocks. Particularly interesting is [memory onexit], since it takes a snapshot at roughly the same time as valgrind does. However, the result is slightly noisier than with [valgrind], because it occurs slightly before the end of the universe (some guts remain to allow the dump to occur).
<<categories>>Debugging