[Techtalk] Build it! redux, and AMD woes

Sat Nov 10 16:01:02 EST 2001

On Sat, Nov 10, 2001 at 08:05:07PM +1100, jenn at simegen.com wrote:
> Akkana wrote:
> 
> > The install went smoothly, but the problem now is that I'm getting
> > random glitches during compiles.  It'll get a few (5-20) minutes into
> > a build, then die with a segv from as or gcc or something, or it'll
> > just hang while ld'ing a library.  Restart the build, and it completes
> > whatever it had problems with but dies somewhere else a few minutes later.  
> 
> Dancer told me of a way to do a reliable Linux memory test.
> Recompile kernels. If they segfault, there's a bad stick of RAM.

I'll definitely second that. gcc seems to be a reliable memory tester :)
I can't exactly tell why, but practical experience shows that, when
there's a memory problem, gcc always is the first program to throw
weird error messages (or segfault or hang), while other programs still
seem to run smoothly. (It's probably because it makes extensive use of
memory to build up delicate data structures which react a bit allergic
to external modification...)
At work, where we're doing a lot of compiling/building, it almost
always turned out to be a bad memory stick, when such problems occured.
The mere number of approx. 15 such cases, which again is about 10% of
the machines in use here, shows that memory problems are not too
unlikely in real life -- well, we humans occasionally exhibit similar
problems too, though we don't tend to lock up as quickly ;)

So, you probably guess that my suggestion is not much different from
Jenn's: get a low-level memory tester (memtest86, as suggested by Mary)
or take out/swap memory sticks to narrow down on the problem. Swapping,
of course, is only possible if you have several hardware-compatible
machines within reach. Occasionally, I've had it that a stick was not
bad per se, but just didn't want to work in a specific board -- after
swapping it with another machine, both machines worked fine... OTOH,
be careful to not swap around wildly, even if the connectors fit
machanically. Always keep an eye on the electrical specifications and
type of RAM.

Although by mere likelihood, I'd say that it's the RAM, there are of
course many other possibilites, which often are a bit tricky to
diagnose.  I don't think that it's the CPU, though. The temperature you
mentioned is perfectly okay -- and processors in general seem to be
tested much more thoroughly than memory sticks (there's an image to
lose for the company, which is not so much an issue with memory chips).

It could be the power supply, too, in case you have many components in
your system. Especially with Athlon systems this in an often overlooked
source of various diverse kind of problems. The processor itself already
pulls a fairly high current, so if there are also enough other power
consuming parts in the system (like HDs, CD-burner, a video card with
the latest gee-whiz graphics engine, several NICs, soundcard, TV-tuner,
and who knows what...) then there's a chance that this is just too much
for the power supply.  It usually is a bit difficult to correctly
estimate the required total power, so it's always a good idea to have a
bit more available. As a general rule of thumb, anything that gets
warm/hot consumes noticeable power -- so your HD probably is one such
candidate. For a typical Athlon system, I wouldn't go below a 300W
power supply, though that pure number isn't the whole truth -- as you
might expect there are differences in quality. Before actually failing
completely, some of them tend to exhibit less desirable properties,
like voltages dropping below the tolerances, spikes/ripples in the
output, etc... (unfortunately, I can't recommend a specific model or
manufacturer, and availability changes quickly, anyway -- hopefully you
know someone in a computer shop you can trust to give proper advice).

Personally, I'd get out my oscilloscope and check the level/smoothness
of the power supply's output lines... yet I understand that not
everyone is an ex electronics junkie, and thus may not have one of
those devices in the attic ;)  But maybe you have a friend who knows
someone who owns such equipment... For the less tricky problems of this
kind a simple multimeter (around $10) will do as well, though it will
not detect spikes an such...  Many modern BIOSes provide a similar
voltage monitoring functionality, but the problem with this is that you
can't watch it in vivo, while compiling...

Does the problem actually always (reproducibly) occur only *after* a
while of compiling activity, or does it also occasionally happen at
random without a 'power consuming history' preceding it?  This minor
distinction might help to tear apart whether it's a temperature-related
or a more static 'just-a-bad-RAM' problem.

Okay, I think I've rambled enough again, so just one last remark on the
temperature of your hard disk:  I'm not so sure about the very latest
WD models, but other WD hard disks I've come across (which are
beginning to gather dust by now), also do get quite warm, but are
working fine otherwise. In short, I don't assume that this is the
problem.

Good luck,

- Almut

PS: which kernel are you using?  I've heard faint rumours that with the
most recent kernel(s) there may be a subtle bug in memory management
which could strike the innocent user, who got used to not expecting
such troubles with linux. Not sure, though... so don't nail me down on
that...  Anyway, to rule out software-side issues, have you tried
running a different kernel version for testing purposes?
(Or is that what you are trying to do at the moment, but can't complete
as a result of that very problem...? ;)