[Techtalk] Diagnosing server problems

Thu Sep 29 10:05:19 EST 2005

On Wed, Sep 28, 2005, Rachel McConnell wrote:
> Thanks for the advice all!  I've done some of it and have questions on 
> other bits...  No solution yet, that I can see.  If anyone has further 
> recommendations, please post them.
> 
> Katherine Spice wrote:
> >1. check that no other machine on the network has the same IP address as
> >your server - this can cause all kinds of odd behaviour!
> 
> Nope, each machine has several IP addresses but there are no duplicates 
> amongst them.
> 
> 
> >2. check the process limit set for the machine. In the 2.2 kernel the
> >default for this was set to 512 process total (256 per user) - which
> >isn't loads on a busy server and if the process table fills up, no new
> >processes can spawn.
> 
> How would I do this?
> 
> 
> >Do you have console access? If so, are you seeing any messages to it
> >during the times when the problem occurs?
> 
> I'm not sure what console access means.... It's a headless box at a 
> colo.  When I am notified by a user that they're having a latency issue, 
> I usually try to ssh to it.  Often I can't.
> 
> 
> R. Daneel Olivaw wrote:
> > Try loggin in locally. This will tell you if the problem is network
> > related or system related.
> 
> By locally, do you mean go to the colo, put a head on it, and log in? 
> The outages tend to last two to ten minutes, I'd have to be very lucky 
> to get to the colo during an event.  During an event, or to be precise, 
> shortly after I've been notified that others are having problems, I can 
> usually ssh to other machines at the same colo on the same network - but 
> occasionally I can't even do that.
> 
> > Try 'atop', it's a more advanced program that also shows you network
> > throughput by interface and quite some more details.
> 
> Is it worth installing by someone who's never installed anything on a 
> Linux box before (that's me, yep)?  The server doesn't have atop currently.

What distribution and what version? If it's a reasonably recent one,
it will be as easy as typing "[something] install atop" but you'll need
to tell us what distro before we can tell you what 'something' is :)

-Mary
> > Try also looking into /var/log/messages ...
> 
> I see a lot of stuff like this:
> 
> Sep 28 08:59:21 elcapitan kernel: IN=eth0 OUT= 
> MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89 
> DST=69.59.189.127 LEN=229 TOS=0x00 PREC=0x00 TTL=128 ID=28654 PROTO=UDP 
> SPT=138 DPT=138 LEN=209
> Sep 28 09:04:00 elcapitan kernel: IN=eth0 OUT= 
> MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89 
> DST=69.59.189.127 LEN=235 TOS=0x00 PREC=0x00 TTL=128 ID=28711 PROTO=UDP 
> SPT=138 DPT=138 LEN=215
> Sep 28 09:11:20 elcapitan kernel: IN=eth0 OUT= 
> MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89 
> DST=69.59.189.127 LEN=229 TOS=0x00 PREC=0x00 TTL=128 ID=28800 PROTO=UDP 
> SPT=138 DPT=138 LEN=209
> 
> which I don't know how to interpret.  There is a message every 5-10 
> minutes, though, which implies that this is Normal.

That looks like iptables (the firewall program) doing some logging.

You interpret it like this:
> Sep 28 09:11:20 elcapitan kernel: IN=eth0 OUT= 
> MAC=ff:ff:ff:ff:ff:ff:00:0f:1f:03:e9:1d:08:00 SRC=69.59.189.89 
> DST=69.59.189.127 LEN=229 TOS=0x00 PREC=0x00 TTL=128 ID=28800 PROTO=UDP 
> SPT=138 DPT=138 LEN=209

IP address 69.59.189.89 tried to send IP address 69.59.189.127 (that's
probably you -- you may not want to send actual IP addresses to a public
list again...) a UDP packet of length 209 to destination port 138.

/etc/services on my machine tells me that UDP port 138 is the
"netbios-dgm" service. This is one of the Windows filesharing service
things.

iptables setups are completely configurable, but it is usual to only log
packets when they are going to be blocked, so that someone can analyse
the logs later and see what attacks people have used against you. It
also isn't usual to log *every* connection failure, because it makes it
easy for someone to make your machine grind to a halt: they just send
you a million packets in a minute, and you try and log all million of
them. So all sensible iptables setups limit their logging to a packet or
two a minute, like it seems yours is doing.

Why would someone be trying to connect to the Windows filesharing port
on your server? Well, it could be an accident, someone has a Windows
server nearby that is misconfigured and is trying to connect to shared
files on your machine. It also might be that 69.59.189.89 has a virus or
is hosting a hack attempt, but essentially people who run Windows and
connect it to the net and leave filesharing open are very vulnerable. So
this port is "scanned" (checked by crackers) all the time. They don't
know you're running Linux, necessarily.

So, in a way this is normal in that almost every machine on the Internet
gets these connection attempts. And it almost certainly isn't the source
of the slowdown.

> Kenneth Gonsalves wrote:
> > distro? Redhat9 by any chance?
> 
> <displays ignorance>
> How would I tell what distro?  I've seen some systems that tell you what 
> they are on login, and I'm sure it would say onscreen if I went there & 
> restarted it... but surely there's an easier way!

Have a look at the contents of any of these files:

/etc/lsb-release
/etc/debian_version
/etc/fedora-release
/etc/redhat-release
/etc/mandrake-release

Those are the most common distro files. The last four uniquely identify
distros, ie if you have /etc/debian_version that means it's Debian. And
the file will contain the version number, eg 3.1 or 3.0 or whatever.

You can find out more about telling them apart here:
http://www.novell.com/coolsolutions/feature/11251.html

-Mary