[Techtalk] sudden, strange DNS/server issues...

Wim De Smet kromagg at gmail.com
Tue Feb 19 15:25:18 UTC 2008


On Feb 19, 2008 2:30 PM, Walt <pippin at freeshell.org> wrote:
> Hi folks...
>
> After struggling for three days with this odd issue, I'm
> here to beg for help!
>
> Friday of last week I got a call from my office that our
> Java/Tomcat/Apache-based job tracker stopped working.
> (Server's running Fedora Core  7, 2.6.22.9-91 SMP on a
> Core 2 Duo processor)

Possibly not the most stable setup. Is the kernel as up to date as can
be? I've been hearing some rumblings about strangeness in the FC
compiled in options etc. Desktop distributions tend to merge somewhat
experimental patches and call it stable.

> Logged into the server and found that 'uptime' displayed
> a 1002.00 server load! I ran 'top' and it said that CPU use
> was at .2% and there didn't seem to be anything strange
> at the top of the list or anything utilizing much RAM.
> I don't know the command (if there is one) to monitor
> disk utilization, so it's possible that something could've
> been writing massively to disk.

If I recall correctly, server load displays the number of processes
waiting on CPU time. So likely you had multiple runaway processes, all
trying to get cpu time. Or one process with a lot of threads. Sounds
like something is waiting for IO and is using spinlocks to do it.
(i.e. continuously test if the resource is free) If your server is
processing lots of requests and the kernel is compiled with a couple
of desktop latency options this might be destroying your throughput.
Or it might be a bug in your java application...

All of this is highly hypothetical of course.

> At any rate, I rebooted the server, job app ran fine
> again, and all seemed to be well. However, ever
> since then our DNS resolution seems to be screwed
> up and painfully, extremely slow! I've been unable to
> find any cause for this. This system is a caching
> nameserver for a our network and handles some
> internal network name resolution. I've tried to changing
> its named delegation to OpenDNS and that makes no
> difference, but if I change an individual workstation to
> point to OpenDNS everything seems to work fine.

Try testing how fast the disk is with hdparm (there's an option to run
tests). Check syslog et al. to see if it's not logging any errors.
Make sure bind's loglevel is sufficiently high for it to complain
about stuff. (not sure how to do this)

HTH,
Wim


More information about the Techtalk mailing list