Actually everything should be ECCed but it is a consumer/professional segmentati...

thanatos519 · on Aug 6, 2022

I didn't really appreciate this until I got an ECC workstation. It's noticeably more stable than my previous machines, even under abusive levels of load.

Noticeably more stable, as in, "never ever crashes" as opposed to "almost never crashes" without ECC. Thanks Linux! :D

PragmaticPulp · on Aug 6, 2022

Did you actually check the ECC error counters?

Memory errors in data centers tend to be concentrated in a small number of bad sticks of RAM rather than evenly distributed across all memory. If you have a machine crashing regularly due to memory errors, it’s likely a bad stick of RAM, not random errors due to lack of ECC.

sliken · on Aug 10, 2022

Agreed, but ECC will prevent most of the errors from crashing the machine and tell you exactly what dimm is causing the problem. So you can replace the dimm (out of pocket or under warranty) instead of playing the replace a random part and hope it improves game.

freemint · on Aug 6, 2022

The ECC error counters are not really honest anymore as repaired faults are often not reported from what i heard.

ChuckMcM · on Aug 7, 2022

This depends on the BIOS and the OS. Correctable (and corrected) errors are typically logged into the baseboard controller (the BMC) and the OS "should" periodically dump the logs from the controller to maintain a record of those errors longer term. That said, eventually they just "fall off" the list of errors the BMC is holding because it stores them round robin.

Uncorrectable errors will cause a machine check, unless the BIOS disables it. Which some do.

freemint · on Aug 7, 2022

I tell you that the reporting of correctable errors is not always honest. On some systems you can see uncorrectable errors before you see the correctable errors which statistically can't be. That it should be honest beside the point.

ChuckMcM · on Aug 7, 2022

There is a question of agency here.

You are absolutely correct that reporting of errors may not be "honest", I prefer to use the work "correct" rather than honest because honesty implies and intention to lie and the computer logic is generally designed to be correct.

The market reality is that reporting correctable errors can generate unnecessary service tickets as people who are not technically sophisticated. Those users may generate a support ticket wondering why their system is telling them it saw and error but then it corrected it. Dealing with support tickets costs money, money that is taken away from margin, and so it is sometimes "optimized out" which is that some manager makes their numbers better by ordering the software team to "hide" correctable errors which were corrected.

It is absolutely fair to describe that management choice as "dishonest."

It can be difficult sometimes to find out where the dishonesty was applied however. It has been my experience that between the BIOS and the kernel, if the chipset supports ECC and the BIOS recognizes that you have ECC memory installed, there is a means for extracting all ECC events reliably. It isn't always well documented and sometimes requires several levels of escalation in support to get the information you need, but when ECC is important to you it can be worth it. It can also inform your choices for vendors to use in the future. :-)

dannyw · on Aug 7, 2022

On hard drives and SSDs, yes, but has this been happening to RAM too?

undersuit · on Aug 7, 2022

I guess you could say it has begun with DDR5 on-die ECC. I don't think you can query those counters.

NegativeLatency · on Aug 6, 2022

Also probably some perception related bias involved here

tinus_hn · on Aug 7, 2022

Perhaps a high quality machine contains devices with high quality drivers.

mad182 · on Aug 6, 2022

I have several non-ecc machines running 24/7 for years without any crashes.

sliken · on Aug 10, 2022

Sure, not particularly surprising. However do you track when systemd restarts a daemon that died? See any strange dmesg errors? Had fsck report filesystem corruptions?

Many errors won't cause a crash, but said errors can accumulate in processes, filesystems, and files and you won't know why. 10 years from now you may find a photo, music file, or binary that's somehow corrupt and have no idea how it happened.

Enginerrrd · on Aug 7, 2022

The only machine I've had consistent crashes with due to memory issues IS my ECC memory workstation. Everything else has NEVER crashed due to a memory issue.

veidr · on Aug 7, 2022

But... how do you know that? Without ECC, the only way would be that none of your other machines have ever crashed at all, right?

Enginerrrd · on Aug 7, 2022

Yes, that's basically it. Except it's more like, "ugh my ECC laptop blue screened again!" vs. me not able to even remember the last crash like that on my non-ECC machines.

My ECC workstation had lots of memory issues where it blue-screened on the windows side. For some reason that wouldn't happen on the Linux side.

Enginerrrd · on Aug 7, 2022

I've actually had the opposite experience. My ECC workstation laptop is the only device I've ever had consistent issues with crashes due to memory issues.

Interestingly, this never happened when I was running Linux and abusing the RAM with photogrammetry workloads. It would only happen with windows when I wasn't using anything beyond routine levels of RAM.

freemint · on Aug 6, 2022

DDR5 includes built-in in ECC. However recoverable errors are not reported and it provides slightly less redundancy then ECC due to some of the redundancy budget compared DDR4-ECC is being eaten up by less reliable cells but in total i expect improved reliability over DDR4 non ECC.

SirGiggles · on Aug 7, 2022

To my understanding for DDR5 on-die ECC is due to the high frequencies the individual dies run at. You are still vulnerable for anything between the RAM to the CPU.

freemint · on Aug 7, 2022

That's correct. Still the likelihood if errors of data at rest is much reduced. Memory controllers might be able to detect a lot of errors between the RAM and CPU due those errors being in the analog domain, while errors in memory are in the digital domain as the signal gets digitised and send over an analog channel.

rhinoceraptor · on Aug 6, 2022

Yes, but you should still use ZFS if you care about your data, even if you don't have ECC.