[Techtalk] msec what is it, and why is it in an hourly cron?

Magni Onsoien magnio+lc-techtalk at pvv.ntnu.no
Tue Nov 5 21:36:30 EST 2002


On Tue, Nov 05, 2002 at 06:34:22PM +0000, E. Sterling Wall said:
> OK, I've got this server which keeps crashing, and I've tried to find
> ANY reason at all that it won't stay up. The good news is that this
> machine has been tuned like no other sever I've ever run (not that
> that's saying much...). The bad news is that it still doesn't want to
> stay up. On average it crashes twice per day. Occasionally more, and
> sometimes it goes a whole day without crashing. The logs are of little
> help. There is nothing obvious left which happens any time in the
> vicinity of the crashes. I'm truly boggled.

I'd check the hardware first if I was you. Specifically all the fans -
make sure they are not filled with dust and hair and jam and sugar and
stuff. Reboot the server and listen for any unusual noise at startup and 
the first minute or two. A change in fan noise is a usual sign of a bad
fan. Remember there are several fans: on the power supply, CPU and maybe
more. At least clean them (remove them, blow them clean far away from
the server, mount them again) and change them if the sound changes
after this :-)

Check if there is a correlation between the air temperature and the
crashes - try to cool the server down and see if it gets better. Look
for segmentation faults of random programs in the logs (usually only a
kernel crash will maake the server crash - if just perl or apache or
something else dies it doesn't kill the server). This is another
indication of a heat problem - again: the fans.

Make sure you have a STABLE power supply (I mean the overall system in
your geographic area, not the PSU of the server) in your area. No peaks 
and no dips. Ask the local energy company about it. If in doubt, get a 
UPS - it doesn't have to keep your system up for a long time, it should 
only handle peaks and dips. If the UPS can log the input power that's great,
then you can see what your power supply is like.

If the fans and the electricity seem to be in order, try to run some 
testing programmes to test the mother board, CPU and memory. Or, change 
ONE of these components. Start with the memory - remove memory modules so 
you only have one left, if that helps you have faulty memory. If it doesn't 
help, try another memory module (if you had several in the server at the
beginning), repeat until all memory are tested. Change the mother board,
see if it crashes. Do the same with the CPU etc. Important is that you
only change ONE component at a time, so you can eliminate errors.

All this doesn't have to take several days to go through. Start with the
fans, this should be done in an hour or so and is a rather probable
source to your problems. Continue with a phone to the local electricity
company and if possible an UPS, then check the rest of the hardware
(which may be harder).

My SO's computer kept crashing - if he ran Seti at home the process wouyld
die almost instantly, if he ran perl it would die 50% of the times.
After a while he cleaned the fans and the computer became stable as
solid rock (or something).

At work we had an incident with the power supply yesterday. I arrived at
work just to see the UPS-monitor on our server claiming the UPS was on
battery. Strange, I thought, since I saw no indications of a power
break. All lamps were on and the monitors worked fine and all
workstations not on the UPS were up. But the UPS claimed it was on battery 
and after a while it halted the servers as it was supposed to. After about 
15 more minutes it announced that it was up again (no longer on battery) 
and we could reboot the servers. It then turned out that the electricity 
had been away in parts of the city and on low voltage in the rest of the 
area, so the UPS obviously got to low voltage in. The strange thing was
that we didn't see anything on lamps and the workstations, and the UPS
is pretty liberal on what it calls an acceptable input voltage (150 -
265V at 50-60Hz, I think) so I don't know what happened.

Anyway, sorry about this long mail. Hopefully it'll be useful for
someone :)


Magni :)
-- 
sash is very good for you.



More information about the Techtalk mailing list