Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How are you able to "see" the single bit errors? I had an ECC machine but I couldn't find any indication that ECC was working or that any bit errors happened.

Reading this I thought the best I could do is to believe that workstation with ECC modules and Xeon will have functioning ECC:

"Unfortunately, we have found that there is no consistent, conclusive way to determine if ECC RAM is working properly. ...We've actually asked Intel, Kingston and Asus over the years for their recommendations for methods to confirm that ECC is working, but we haven't gotten much more back than a blank stare."

https://www.pugetsystems.com/labs/articles/How-to-Check-ECC-...



The BIOS has to correct them, you get what is essentially a machine check, the BIOS fixes the error, does the writeback, and in my case logs the fix on the system management (SMB) bus and that shows up in the IPMI system event log[1]. Not sure who the BIOS would tell if there wasn't some sort of BMC chip to hear about it, it might keep an internal log.

I'm surprised the board/bios manufacturer couldn't answer the question, it's a pretty straight forward system.

[1] They show up as 'SBE' events (single bit error). An 'MBE' event (detectable but not correctable) would presumably result in an unhandled machine check and cause the kernel to reboot.


That's interesting. Is it possible to use IPMI and log BMC events on a regular workstation from Dell or Lenovo, perhaps via some expansion card? Or is this only a server-grade feature?


They're also logged by some portable mechanism on normal x86 boards, viewable at least using the "mcelog" tool under Linux, and probably logged by the kernel in the normal kernel event log. I think the mce log is supposed to persist over reboots so you could see the multibit error afterwards if you got a machine check induced reboot.


Linux has drivers that receive ECC notifications from the hardware and the edac-util command should display them.

You can cause bit errors with a hair dryer and see if Linux recognizes them: http://bluesmoke.sourceforge.net/heat_gun.html


Thanks a lot! I'll probably try both after I get my machine working again.


If you think that link is about hairdryers, I really suggest you do some shopping.

You'll find less smoldering, better smell, and less pain if you use an actual hair dryer to dry your hair (rather than a heat gun, which the page is about).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: