[Techtalk] Diagnosing server problems

Tue Sep 27 10:21:47 EST 2005

I would be very interested in people's take on how to track this
down. I have some similar issues on a lightly loaded box - and the
monitoring I am doing on it is inadequate to finding the
problem. Despite that negative recommendation, Rachel, if you have
another box you can use for collecting and graphing the stats, I would
suggest installing Orcaware (http://www.orcaware.com/orca/) on the
problem box and running the appropriate data gatherer (procallator if
this is a Linux box), then rsyncing or scping the generated files off
to another machine for analysis.

>From what I have read, just because you don't have a lot of memory
listed as explicityly "free" doesn't neccessarily mean you are short
on memory. If there is plenty of memory, then OS should use it as disk
cache to avoid IO time. But I am not sure exactly what stat one should
monitor to distinguish appropriate cache use vs. a real lack of memory.

http://www.orcaware.com/orca/stats/procallator/procallator_gauge_mem_used_pct,procallator_gauge_mem_free_pct,procallator_gauge_mem_shrd_pct,procallator_gauge_mem_buff_pct,procallator_gauge_mem_cchd_pct-daily.html

Quoting Rachel McConnell <rachel at xtreme.com>:
> Hi all,
> 
> I have a server machine which periodically ... hangs, slows, or 
> something.  For a minute here, or ten minutes there, I and others cannot:
> 
> * access web applications running on it
> * ssh into it (times out)
> 
> I can't tell if these are times when NOTHING anywhere can get through to 
> it, or if they are times when some users can get through after a bit of 
> a wait, but others can't, as if it were under extremely heavy load. 
> I've not previously done any real server management, but there isn't 
> anyone else any more to do it, just Me.
> 
> Anyway, I have some vague thoughts on why this might be happening, but 
> no real idea how to test any of my theories.
> 
> For example, does the box have enough memory?  The following is from the 
> headers of top, shortly after one of these "slow" times:
> 
>  16:20:24  up 110 days, 12:13,  1 user,  load average: 0.00, 0.00, 0.00
> 238 processes: 237 sleeping, 1 running, 0 zombie, 0 stopped
> CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
>            total    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%   99.8%
>            cpu00    0.0%    0.0%    0.3%   0.0%     0.0%    0.1%   99.4%
>            cpu01    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
>            cpu02    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
>            cpu03    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
> Mem:  3094228k av, 3002716k used,   91512k free,       0k shrd,  234328k buff
>                    1937528k actv,    5820k in_d,   46812k in_c
> Swap: 4192956k av,  130932k used, 4062024k free                 1343584k 
> cached
> 
> Obviously the CPUs aren't being strained at all.  But do the memory data 
> indicate heavy usage or is 91512k free actually perfectly adequate?  Am 
> I even reading this correctly?
> 
> Some of the other possible things I can think of are
> * insufficiently frequent garbage collection by the Java web apps 
> running on it
> * heavy usage on other machines at the colo that share bandwidth
> * misconfigured DNS somewhere that might be causing delay for some users
> 
> Surely there are other possibilities as well.  Any thoughts of any kind 
> are appreciated!
> 
> Rachel
> _______________________________________________
> Techtalk mailing list
> Techtalk at linuxchix.org
> http://linuxchix.org/cgi-bin/mailman/listinfo/techtalk

-- 
Cynthia N. Kiser
cnk at ugcs.caltech.edu