[Techtalk] tracking down resource leaks

Fri Jan 5 08:35:21 UTC 2007

On Wed, Jan 03, 2007 at 10:06:48AM -0600, N Hospodarsky wrote:
> Hi All,
> 
> I have a question about resource monitoring. I have a RH server that
> is running proprietary RH software on it. It seems that one of the
> processes in that software is slowly sucking up CPU resources. I've
> been following it using Cricket (http://cricket.sourceforge.net/), and
> over the last month there's been a steady upward trend of CPU
> resoruces being used by %User.
> 
> I have been trying to get as much information as possible before
> opening a ticket with the vendor; I'm curious what you all generally
> do when attempting to  track down resource leaks...so far I've
> narrowed it down to a python process, using the typical looking
> through logs, getting information from PS....and have used strace to
> minimally look for information...strace wasn't all that illuminating
> to me because the output was just a huge stream of:
> 
> futex(0x9fbc1c0, FUTEX_WAKE, 1)         = 0
> 
> which means nothing to me.
> 
> What else can I use to get information about a leaky process? Or is
> this information the best I can hope for with my non-python-programmer
> skillset?

Hi,

not sure I can contribute anything useful to this problem, but as no
one else has said anything so far, I'll just say something :)

If you had reported memory leakage problems, I would have recommended
tools like valgrind [1] -- but you haven't, so I won't ;)  And problems
with CPU usage is quite a different beast. I'm not aware of any general
purpose debugging tools for this, except maybe some profiler. It could
tell you how much time the program is spending in indidual parts, like
subroutines, etc. But this typically only makes sense if you have the
sources (as you mention it's proprietary software, I presume you don't).

In case strace and ltrace aren't providing any useful info, there's
essentially only a general purpose debugger like gdb to resort to.
However, in order to debug a program that's essentially working, with
only some slow gradual degredation over time (caused by some yet
unknown part of the program), a lot of patience and expertise in the
proper handling of the debugger would be required. (Also, you can never
be sure that the debugging itself won't have a significant impact on
what you're observing, i.e. how the application is behaving.)

The fact that you're observing a gazillion of futex calls [2], probably
doesn't mean much. They're most likely just reflecting the regular
synchronization activity of some threaded program, or some such.
And just in case there should really be a problem at this level, it's
not something you'd want to debug yourself, most certainly not without
access to the source code of the application that's causing the
problems...

What you could try is the following: based on your knowledge of what
the software needs to accomplish, and how it might go about doing it at
the implementation level, come up with ideas which aspects external to
the program might be involved, and whether those aspects might either
help shed some light on what the program is doing wrong, or whether it
simply is something outside of the program that's forcing it to do more
and more work the longer it is running. It might not be the program's
fault after all. Then develop test scenarios to verify those hypotheses.
Sounds a little abstract, I know, but as I haven't got the foggiest
idea what the software is for, you'll understand I can't help you
generate hypotheses...

Anyway, maybe it's best to just go right ahead and delegate the issue
to RH. They should be the experts for their software.  So, what I would
do is

* make sure I'm not overlooking something silly (I think you're past
  this step already)

* collect evidence that there is a problem, and how it manifests

* open a call with the vendor, and pass on the collected info

As to step two, you could periodically run top from a cronjob to gather
general resource usage info (if you haven't done so already). Something
like "(date; top -b -n 1) >>top.stats". (The -b makes top run
non-interactively).  It's probably helpful to extend this to also log
other info you _suspect_ might have to do with the problem (for example
size of files being used, number/status of open sockets, etc.).
Then, at the end of the observation period, maybe run some script over
the results to filter out irrelevant stuff.

And, if all else fails to resolve the issue, there's still the good ol'
"restart once a day" strategy.  (Under Linux you usually only need to
restart the problematic process, not the entire OS -- but you never
know...).  Not really satisfactory for those with a keen sense of
beauty-in-IT, but at least pragmatic.

Well, as I warned you up-front, nothing really enlightening :)
But good luck anyway,

Almut

[1] http://valgrind.org/

[2] futexes are typically used to synchronize usage of memory and other
resources that are shared between threads or processes.
Just in case you want to learn more about this tricky kind of stuff,
there's a good article by Ulrich Drepper (not an easy read, though):
http://people.redhat.com/drepper/futex.pdf