[Techtalk] Diagnosing server problems

Rachel McConnell rachel at xtreme.com
Tue Sep 27 09:43:07 EST 2005


Hi all,

I have a server machine which periodically ... hangs, slows, or 
something.  For a minute here, or ten minutes there, I and others cannot:

* access web applications running on it
* ssh into it (times out)

I can't tell if these are times when NOTHING anywhere can get through to 
it, or if they are times when some users can get through after a bit of 
a wait, but others can't, as if it were under extremely heavy load. 
I've not previously done any real server management, but there isn't 
anyone else any more to do it, just Me.

Anyway, I have some vague thoughts on why this might be happening, but 
no real idea how to test any of my theories.

For example, does the box have enough memory?  The following is from the 
headers of top, shortly after one of these "slow" times:

  16:20:24  up 110 days, 12:13,  1 user,  load average: 0.00, 0.00, 0.00
238 processes: 237 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
            total    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%   99.8%
            cpu00    0.0%    0.0%    0.3%   0.0%     0.0%    0.1%   99.4%
            cpu01    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
            cpu02    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
            cpu03    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
Mem:  3094228k av, 3002716k used,   91512k free,       0k shrd,  234328k 
buff
                    1937528k actv,    5820k in_d,   46812k in_c
Swap: 4192956k av,  130932k used, 4062024k free                 1343584k 
cached

Obviously the CPUs aren't being strained at all.  But do the memory data 
indicate heavy usage or is 91512k free actually perfectly adequate?  Am 
I even reading this correctly?

Some of the other possible things I can think of are
* insufficiently frequent garbage collection by the Java web apps 
running on it
* heavy usage on other machines at the colo that share bandwidth
* misconfigured DNS somewhere that might be causing delay for some users

Surely there are other possibilities as well.  Any thoughts of any kind 
are appreciated!

Rachel


More information about the Techtalk mailing list