> If you run a scrub/whatever to check validity of old data and your system has a new memory issue, it could destroy everything very quickly even without new file writes.
> Let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub. First, you read a block. This block is good. It is perfectly good data written to a perfectly good disk with a perfectly matching checksum. But that block is read into evil RAM, and the evil RAM flips some bits. Perhaps those bits are in the data itself, or perhaps those bits are in the checksum. Either way, your perfectly good block now does not appear to match its checksum, and since we’re scrubbing, ZFS will attempt to actually repair the “bad” block on disk. Uh-oh! What now?
> Next, you read [a copy of the same block from another disk]. Now, if your evil RAM leaves this block alone, ZFS will see that the second copy matches its checksum, and so it will overwrite the first block with the same data it had originally – no data was lost here, just a few wasted disk cycles. OK. But what if your evil RAM flips a bit in the second copy? Since it doesn’t match the checksum either, ZFS doesn’t overwrite anything. It logs an unrecoverable data error for that block, and leaves both copies untouched on disk.
I did not mention ZFS specifically. If ZFS has better handling of this kind of thing, that's great, but if you can't trust your memory to be correct you can't trust the data in buffers, the data being hashed, or the data being read from or written out to disk. Additionally, you can't trust the filesystem to behave in the ways that it should. There are many kinds of memory errors, some may for example impact certain data sequences in a fairly deterministic way. Some are completely random, some can be triggered by users or attackers.
Unless the filesystem is behaving in a way that is overwhelmingly stupid, the basic logic should still apply. I don't understand how error checking could ever cause data corruption. It might let you know about data corruption which would otherwise have gone unnoticed, but that's not the same thing.
If there is a filesystem that is dumb enough to cause corruption during the checksumming process, please let me know which one, so I can be sure to never ever ever go anywhere near it. :)
A lot of things in computing are overwhelmingly stupid or assume everything will work as expected. I have experienced several data corruption events related to parity data being read incorrectly, not in ZFS, but with hardware and software raid controllers. In one case the hardware raid controller even had ECC memory, but its memory was overheating and thus introducing bad data into calculations when multi bit errors were not correctable. A similarly horrific error condition saw a controller confuse disk IDs in memory and start mirroring one drive to every other drive in the system.
Those are not instances of error checking causing data corruption. As I said, "I don't understand how error checking could ever cause data corruption."
Error checking will only ever help you, not hurt you. It doesn’t matter how bad you memory or disk or raid controller is. Error checking won't necessarily save you from those things, but it can in some cases, and it’ll never make things worse.
But they are though, the parity data calcs being corrupted in that first example caused data corruption during a scheduled array check while the system was under unusually heavy load. Error checking is good, and when things are working right it can only help. That is true, but it can't always be counted on if the hardware, software, etc is untrustworthy for whatever reason.
Okay, well I am totally and utterly confused as to how that could ever be possible, regardless of the hardware. You're confident that if not for the data validation the problem wouldn't have occurred?
No, it won't.
https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...
> Let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub. First, you read a block. This block is good. It is perfectly good data written to a perfectly good disk with a perfectly matching checksum. But that block is read into evil RAM, and the evil RAM flips some bits. Perhaps those bits are in the data itself, or perhaps those bits are in the checksum. Either way, your perfectly good block now does not appear to match its checksum, and since we’re scrubbing, ZFS will attempt to actually repair the “bad” block on disk. Uh-oh! What now?
> Next, you read [a copy of the same block from another disk]. Now, if your evil RAM leaves this block alone, ZFS will see that the second copy matches its checksum, and so it will overwrite the first block with the same data it had originally – no data was lost here, just a few wasted disk cycles. OK. But what if your evil RAM flips a bit in the second copy? Since it doesn’t match the checksum either, ZFS doesn’t overwrite anything. It logs an unrecoverable data error for that block, and leaves both copies untouched on disk.