[Techtalk] Diagnosing server problems
Rachel McConnell
rachel at xtreme.com
Thu Sep 29 09:07:17 EST 2005
Thanks for the advice all! I've done some of it and have questions on
other bits... No solution yet, that I can see. If anyone has further
recommendations, please post them.
Katherine Spice wrote:
> 1. check that no other machine on the network has the same IP address as
> your server - this can cause all kinds of odd behaviour!
Nope, each machine has several IP addresses but there are no duplicates
amongst them.
> 2. check the process limit set for the machine. In the 2.2 kernel the
> default for this was set to 512 process total (256 per user) - which
> isn't loads on a busy server and if the process table fills up, no new
> processes can spawn.
How would I do this?
> Do you have console access? If so, are you seeing any messages to it
> during the times when the problem occurs?
I'm not sure what console access means.... It's a headless box at a
colo. When I am notified by a user that they're having a latency issue,
I usually try to ssh to it. Often I can't.
R. Daneel Olivaw wrote:
> Try loggin in locally. This will tell you if the problem is network
> related or system related.
By locally, do you mean go to the colo, put a head on it, and log in?
The outages tend to last two to ten minutes, I'd have to be very lucky
to get to the colo during an event. During an event, or to be precise,
shortly after I've been notified that others are having problems, I can
usually ssh to other machines at the same colo on the same network - but
occasionally I can't even do that.
> Try 'atop', it's a more advanced program that also shows you network
> throughput by interface and quite some more details.
Is it worth installing by someone who's never installed anything on a
Linux box before (that's me, yep)? The server doesn't have atop currently.
>>Obviously the CPUs aren't being strained at all. But do the memory
>>data indicate heavy usage or is 91512k free actually perfectly
>>adequate? Am I even reading this correctly?
>
> This is only the 'real' free memory, the system uses free memory for
> caching and buffering so the maximum amount of physical ram is used to
> enhance performance. Usually, a memory shortage is indicated by heavy
> use of swap. In and Out swapping also reduces system performance.
Any suggestions on where I might read up on swap and general memory
info? I don't entirely follow the above but I figure I ought to learn it.
>>* heavy usage on other machines at the colo that share bandwidth
>
> using hubs or switches ?
Have eliminated this, at least for now. The colo provides some logging
and our peak bandwidth usage is about 10% of our allowance. There isn't
heavy usage on any of the boxes there.
> Try also looking into /var/log/messages ...
I see a lot of stuff like this:
Sep 28 08:59:21 elcapitan kernel: IN=eth0 OUT=
MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89
DST=69.59.189.127 LEN=229 TOS=0x00 PREC=0x00 TTL=128 ID=28654 PROTO=UDP
SPT=138 DPT=138 LEN=209
Sep 28 09:04:00 elcapitan kernel: IN=eth0 OUT=
MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89
DST=69.59.189.127 LEN=235 TOS=0x00 PREC=0x00 TTL=128 ID=28711 PROTO=UDP
SPT=138 DPT=138 LEN=215
Sep 28 09:11:20 elcapitan kernel: IN=eth0 OUT=
MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89
DST=69.59.189.127 LEN=229 TOS=0x00 PREC=0x00 TTL=128 ID=28800 PROTO=UDP
SPT=138 DPT=138 LEN=209
which I don't know how to interpret. There is a message every 5-10
minutes, though, which implies that this is Normal.
> Else, use webmin's "system status" module to monitor local services and
> network connectivity (ping/http/...) from inside the server and raise
> mail alerts automatically (the server will queue e-mails if not
> connected). Also, make sure the server hasn't just 'booted' (use
> 'uptime' command).
Uptime is over 100 days. Can you elaborate on the other bits?
Kenneth Gonsalves wrote:
> distro? Redhat9 by any chance?
<displays ignorance>
How would I tell what distro? I've seen some systems that tell you what
they are on login, and I'm sure it would say onscreen if I went there &
restarted it... but surely there's an easier way!
Mary wrote:
> Have a look at the output of vmstat, particularly to see if there's a
> lot of IO activity. That can mean that things are being swapped in and
> out to the hard disk a lot, which slows things down very dramatically.
I wrote a small shell wrapper around vmstat, to write its output to a
file about every 15 seconds. An example of its output is this:
procs memory swap io system
cpu
r b swpd free buff cache si so bi bo in cs us
sy id wa
0 1 130928 171116 262032 1387556 0 0 0 754 351 611 1
0 92 7
0 0 130928 170580 262032 1387764 0 0 0 851 368 637 1
0 91 7
0 1 130928 171048 262032 1387964 0 0 0 749 350 595 1
0 93 6
0 0 130928 170092 262032 1388148 0 0 0 868 372 655 1
1 90 8
0 0 130928 169020 262032 1388348 0 0 0 749 351 645 1
0 93 6
0 0 130928 168824 262032 1388488 0 0 0 820 370 660 1
0 92 7
0 0 130928 170276 262036 1388636 0 0 0 1752 350 917 1
0 86 12
0 0 130928 170436 262036 1388884 0 0 0 772 350 642 1
1 92 7
0 0 130928 170268 262036 1389068 0 0 0 896 383 703 1
1 90 8
0 0 130928 169768 262036 1389284 0 0 0 747 349 652 1
0 92 6
If I'm interpreting this right, it's not doing any IO swapping (the si
and so columns are nearly always 0, with very occasional values of 1 -
less than 1%). I don't really Get how to interpret the memory values,
as noted above.
It was suggested to me that I might be seeing a database lock problem,
but surely if this were causing my problem, it would be b/c it was
taking up all the system resources? Anyone think this is worth
investigating?
Thanks much,
Rachel
More information about the Techtalk
mailing list