Leak Hunt (C level)

This page is here to collect tips and tricks that are useful while hunting memory leaks at the C level, in the Tcl core or in extensions. To track script-level leaks (like lingering keys in global arrays, objects, channels, etc), see sibling page "Leak Hunt (Tcl level)".

Description

Once Valgrind has given you a detailed C-level stack trace of the point of allocation and a leak is spotted, of course it's time to switch to gdb. Different techiques are needed for different kinds of allocations, but a rather frequent class is that of refcounted things (like Tcl_Obj's). Here gdb's watchpoint tool is very useful:

  • breakpoint just after the allocation
  • set a watchpoint on ((ProperType *)0xHEX_ADDRESS_OF_OBJECT)->refCount
  • hit "cont" repeatedly and follow its Incr and Decr
  • it never hits zero (it's a leak, remember). Usually ends up at 1.
  • replay the scenario while taking note of where the extra refs are kept on Incr, and which are discarded on Decr
  • In the end you'll know whether:
    • a Decr is simply missing
    • a reference is mistakenly overwritten
    • a reference is kept in other stuff that leaks too (in that case the block is "indirectly lost")

Reference Loops

One thing that should never occur in Tcl is a cycle in the graph of references among Tcl_Objs. Indeed, since we depend on refcounting for memory management, a cycle is an absolute show stopper. Fortunately, by design, the language is strictly unable to produce such "reference loops": the copy-on-write principle prevents all in-place operations which would "close the loop". Any exception to this rule is a bug, not in the script, but in the core.

But here we're discussing core debugging, right ? So these things may happen. It happened to me in issue 3386417 , where the newly introduced info errorstack (TIP #348 ) was somehow plugged back into a compiled scriplet that it referred to. The interesting generalization is as follows:

If:

  • you get a "definitely lost" object
  • you know that references to it remain
  • you know that these referers are still alive too, though they shouldn't
  • these referers end up in "indirectly lost" (not "still reachable", nor "definitely lost")

Then it is very likely that you have a Reference Loop. Of course you're on your own to actually track it down, but knowing the mere existence of their kind may save you hours (it would have, in my case :/ ).

When Valgrind Is Not Enough

Sometimes, the above techniques are insufficient. For example, valgrind may pinpoint the full stack trace of the offending allocation... but if the same stack happens many times in the program's lifespan, it is hard to identify the interesting one. In that case, switching back to Tcl's own memdebug mode is helpful: revert the above advice and do

./configure --enable-symbols=mem

memory is then the tool of choice to get to individual leaked blocks. Particularly interesting is memory onexit, since it takes a snapshot at roughly the same time as valgrind does. However, the result is slightly noisier than with valgrind, because it occurs slightly before the end of the universe (some guts remain to allow the dump to occur).

PYK 2018-05-20: It can also be helpful to call memory active right before a suspected trouble spot, once again immediately after, and then inspect a diff of the two results.

Case Histories

Re: <fossil-users> Version 1.33 - /reports failing , 2015-06-02
Not a leak, but a tricky bug that was initially attributed to a C compiler optimisation bug, but turned out to be an array having automatic storage scope declared in a conditional block and then getting passed to another function which squirrels the pointer away in a variable with static duration. The automatic storage then goes out of scope and is reclaimed, leaving a dangling pointer in the static variable. Compilers don't catch this situation and valgrind often doesn't either, leaving human alertness as nearly the only defense. The fix is here .

AMG: Actually, Valgrind can show this class of problem.

#include <stdio.h>
int *global;
void f(void) {int local = 42; global = &local;}
void g(void) {printf("%d\n", *global);}
int main(void) {f(); g(); return 0;}

Build and run this, and "42" is probably displayed. Maybe. Call something else between f() and g(), and who knows what you'll get.

Run using valgrind --tool=memcheck, and this is the result:

==22929== Conditional jump or move depends on uninitialised value(s)
==22929==    at 0x4E834E0: vfprintf (in /lib64/libc-2.21.so)
==22929==    by 0x4E8A8E8: printf (in /lib64/libc-2.21.so)
==22929==    by 0x400675: g (in /home/andy/test)
==22929==    by 0x400685: main (in /home/andy/test)
==22929==
==22929== Use of uninitialised value of size 8
==22929==    at 0x4E7F4BB: _itoa_word (in /lib64/libc-2.21.so)
==22929==    by 0x4E837C8: vfprintf (in /lib64/libc-2.21.so)
==22929==    by 0x4E8A8E8: printf (in /lib64/libc-2.21.so)
==22929==    by 0x400675: g (in /home/andy/test)
==22929==    by 0x400685: main (in /home/andy/test)
==22929==
==22929== Conditional jump or move depends on uninitialised value(s)
==22929==    at 0x4E7F4C5: _itoa_word (in /lib64/libc-2.21.so)
==22929==    by 0x4E837C8: vfprintf (in /lib64/libc-2.21.so)
==22929==    by 0x4E8A8E8: printf (in /lib64/libc-2.21.so)
==22929==    by 0x400675: g (in /home/andy/test)
==22929==    by 0x400685: main (in /home/andy/test)
==22929==
==22929== Conditional jump or move depends on uninitialised value(s)
==22929==    at 0x4E83839: vfprintf (in /lib64/libc-2.21.so)
==22929==    by 0x4E8A8E8: printf (in /lib64/libc-2.21.so)
==22929==    by 0x400675: g (in /home/andy/test)
==22929==    by 0x400685: main (in /home/andy/test)
==22929==
==22929== Conditional jump or move depends on uninitialised value(s)
==22929==    at 0x4E835BA: vfprintf (in /lib64/libc-2.21.so)
==22929==    by 0x4E8A8E8: printf (in /lib64/libc-2.21.so)
==22929==    by 0x400675: g (in /home/andy/test)
==22929==    by 0x400685: main (in /home/andy/test)
==22929==
==22929== Conditional jump or move depends on uninitialised value(s)
==22929==    at 0x4E8364A: vfprintf (in /lib64/libc-2.21.so)
==22929==    by 0x4E8A8E8: printf (in /lib64/libc-2.21.so)
==22929==    by 0x400675: g (in /home/andy/test)
==22929==    by 0x400685: main (in /home/andy/test)

Lots of badness. But change the variable local to be static, and all the above warnings go away.