[Techtalk] Diagnosing server problems

Thu Sep 29 09:07:17 EST 2005

Thanks for the advice all!  I've done some of it and have questions on 
other bits...  No solution yet, that I can see.  If anyone has further 
recommendations, please post them.

Katherine Spice wrote:
> 1. check that no other machine on the network has the same IP address as
> your server - this can cause all kinds of odd behaviour!

Nope, each machine has several IP addresses but there are no duplicates 
amongst them.

> 2. check the process limit set for the machine. In the 2.2 kernel the
> default for this was set to 512 process total (256 per user) - which
> isn't loads on a busy server and if the process table fills up, no new
> processes can spawn.

How would I do this?

> Do you have console access? If so, are you seeing any messages to it
> during the times when the problem occurs?

I'm not sure what console access means.... It's a headless box at a 
colo.  When I am notified by a user that they're having a latency issue, 
I usually try to ssh to it.  Often I can't.

R. Daneel Olivaw wrote:
 > Try loggin in locally. This will tell you if the problem is network
 > related or system related.

By locally, do you mean go to the colo, put a head on it, and log in? 
The outages tend to last two to ten minutes, I'd have to be very lucky 
to get to the colo during an event.  During an event, or to be precise, 
shortly after I've been notified that others are having problems, I can 
usually ssh to other machines at the same colo on the same network - but 
occasionally I can't even do that.

 > Try 'atop', it's a more advanced program that also shows you network
 > throughput by interface and quite some more details.

Is it worth installing by someone who's never installed anything on a 
Linux box before (that's me, yep)?  The server doesn't have atop currently.

 >>Obviously the CPUs aren't being strained at all.  But do the memory
 >>data  indicate heavy usage or is 91512k free actually perfectly
 >>adequate?  Am  I even reading this correctly?
 >
 > This is only the 'real' free memory, the system uses free memory for
 > caching and buffering so the maximum amount of physical ram is used to
 > enhance performance. Usually, a memory shortage is indicated by heavy
 > use of swap. In and Out swapping also reduces system performance.

Any suggestions on where I might read up on swap and general memory 
info?  I don't entirely follow the above but I figure I ought to learn it.

 >>* heavy usage on other machines at the colo that share bandwidth
 >
 > using hubs or switches ?

Have eliminated this, at least for now.  The colo provides some logging 
and our peak bandwidth usage is about 10% of our allowance.  There isn't 
heavy usage on any of the boxes there.

 > Try also looking into /var/log/messages ...

I see a lot of stuff like this:

Sep 28 08:59:21 elcapitan kernel: IN=eth0 OUT= 
MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89 
DST=69.59.189.127 LEN=229 TOS=0x00 PREC=0x00 TTL=128 ID=28654 PROTO=UDP 
SPT=138 DPT=138 LEN=209
Sep 28 09:04:00 elcapitan kernel: IN=eth0 OUT= 
MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89 
DST=69.59.189.127 LEN=235 TOS=0x00 PREC=0x00 TTL=128 ID=28711 PROTO=UDP 
SPT=138 DPT=138 LEN=215
Sep 28 09:11:20 elcapitan kernel: IN=eth0 OUT= 
MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89 
DST=69.59.189.127 LEN=229 TOS=0x00 PREC=0x00 TTL=128 ID=28800 PROTO=UDP 
SPT=138 DPT=138 LEN=209

which I don't know how to interpret.  There is a message every 5-10 
minutes, though, which implies that this is Normal.

 > Else, use webmin's "system status" module to monitor local services and
 > network connectivity (ping/http/...) from inside the server and raise
 > mail alerts automatically (the server will queue e-mails if not
 > connected). Also, make sure the server hasn't just 'booted' (use
 > 'uptime' command).

Uptime is over 100 days.  Can you elaborate on the other bits?

Kenneth Gonsalves wrote:
 > distro? Redhat9 by any chance?

<displays ignorance>
How would I tell what distro?  I've seen some systems that tell you what 
they are on login, and I'm sure it would say onscreen if I went there & 
restarted it... but surely there's an easier way!

Mary wrote:
 > Have a look at the output of vmstat, particularly to see if there's a
 > lot of IO activity. That can mean that things are being swapped in and
 > out to the hard disk a lot, which slows things down very dramatically.

I wrote a small shell wrapper around vmstat, to write its output to a 
file about every 15 seconds.  An example of its output is this:

procs                      memory      swap          io     system 
    cpu
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
  0  1 130928 171116 262032 1387556    0    0     0   754  351   611  1 
  0 92  7
  0  0 130928 170580 262032 1387764    0    0     0   851  368   637  1 
  0 91  7
  0  1 130928 171048 262032 1387964    0    0     0   749  350   595  1 
  0 93  6
  0  0 130928 170092 262032 1388148    0    0     0   868  372   655  1 
  1 90  8
  0  0 130928 169020 262032 1388348    0    0     0   749  351   645  1 
  0 93  6
  0  0 130928 168824 262032 1388488    0    0     0   820  370   660  1 
  0 92  7
  0  0 130928 170276 262036 1388636    0    0     0  1752  350   917  1 
  0 86 12
  0  0 130928 170436 262036 1388884    0    0     0   772  350   642  1 
  1 92  7
  0  0 130928 170268 262036 1389068    0    0     0   896  383   703  1 
  1 90  8
  0  0 130928 169768 262036 1389284    0    0     0   747  349   652  1 
  0 92  6

If I'm interpreting this right, it's not doing any IO swapping (the si 
and so columns are nearly always 0, with very occasional values of 1 - 
less than 1%).  I don't really Get how to interpret the memory values, 
as noted above.

It was suggested to me that I might be seeing a database lock problem, 
but surely if this were causing my problem, it would be b/c it was 
taking up all the system resources?  Anyone think this is worth 
investigating?

Thanks much,
Rachel