[wellylug] hardware errors

Ewen McNeill wellylug at ewen.mcneill.gen.nz
Wed Jun 11 22:40:25 NZST 2014


On 11/06/14 20:35, Richard Hector wrote:
> It's a Sun Fire X2100 M2 - dual core opteron 1218, 2600MHz with 4G of RAM

According to:

http://docs.oracle.com/cd/E19121-01/sf.x2100m2/819-6591-11/Chap4.html#17924
http://docs.oracle.com/cd/E19121-01/sf.x2100m2/819-6591-11/Chap1.html

it takes unbuffered DDR2 RAM, in pairs, and needs to be populated from 
slot 0 to slot 3 -- supporting 0.5/1/2 GB DIMMs.   Some memory suppliers 
seem to think faster DDR2 (eg, DDR2-667 -- 
http://www.memoryxsun.com/mtx5278aa.html) will work as well as DDR2-400, 
but it does need to be ECC Unbuffered.

If the RAM was triggering ECC issues there's a reasonable chance 
memtest86+ would show it -- it does hook into the reporting mechanism of 
several servers and should, eg, catch the NMI reports at least.  If not 
there, they should also show up in the server out of band management 
logs if there are RAM ECC issues.  (And the out of band management 
should be able to identify the specific DIMM with issues, since it knows 
the memory layout to physical DIMM mapping.)

However given the reports, and age of the hardware (IIRC the Sun Fire 
X2100 M2 was sold from around 8 years ago to around 5 years ago), I'd 
also be wondering about mechanical causes too.  For instance, a CPU that 
is running a bit hot due to a fan that's Rather Old (and, eg, dusty) 
might periodically "miss a bit" when running more than usual.  Either 
checking the out of band management (IIRC that'll display fan speeds) or 
a physical inspection might be warranted.  (It could be back of case 
fans or mid-case fans, CPU fan or even a power supply fan causing the 
rails to dip a bit under load.)  If it gets worse when, eg, something is 
keeping the CPU busy that'd definitely be my pick.

I'd be inclined to check for fans/heat issues before doing anything more 
with the RAM, given the age.

It may be that the upgrade to Wheezy simply provided a mechanism to see 
an issue that's been happening for a while, but hasn't been logged 
outside the out of band management until now.

Ewen



More information about the wellylug mailing list