Don’t Be Afraid of RAID

Exmoor · on May 24, 2020

The overall premise of this article is good and needs to be said, but the author missed an opportunity by lumping traditional hardware RAID solutions with software solutions like ZFS. The general consensus I've seen from people spinning up large storage arrays in recent years is that the software solutions are far superior for many of the reasons the author mentions.

swills · on May 24, 2020

I agree and think it also needs to be pointed out that "traditional hardware RAID solutions" are software (firmware) on a separate processor, software which is often in my experience troublesome, buggy, inscrutable, difficult to update, when updates are available, etc.

ajsnigrutin · on May 25, 2020

But... nobody ever got fired for choosing IBM!

If you set up "your own solution", you're always the "guilty one" if something fails. If you buy IBM (or HP or whatever), it's their fault if something fails.

(i did some of that stuff in enterprise enviroments.... honestly, IBM/HP/... + paid support was always the better option for me, mainly due to "liability" issues)

zrm · on May 25, 2020

"Nobody ever got fired for buying IBM." -IBM, just before PC servers ate their lunch.

My experience has always been that it's better to do your homework so you don't have problems than to have a scapegoat to blame the problems on.

And the extent to which it's "their fault" tends to be quite inadequate. If you build a 16-drive RAID5 on some enterprise vendor's equipment and then suffer a multiple disk failure during a rebuild because that was a Bad Idea, what's the support contract going to get you? An overnight replacement for both failed drives? All your data is still gone.

ajsnigrutin · on May 25, 2020

> If you build a 16-drive RAID5 on some enterprise vendor's equipment and then suffer a multiple disk failure during a rebuild because that was a Bad Idea...

If I do that with a linux/solaris/mdadm/zfs/btrfs/whatever, and it fails, who will be the one blamed for the failure? Me of course.

If i select a propper raid setup, and it still failes, it's better if it fails on an IBM than on my duct-tape linux/mdadm/zfs/btrfs/whatever solution, because again, I personally can say, that i did what the vendor (IBM) recommended in their brochure, and their solution failed. If the duct-tape solution fails, there's noone else to blame, than me.

Either way, you have to restore from backup.

And if there is some random bug (eg. kernel oops when mounting volume), there's IBM support on the call. Working with standard options, means standard, well-known bugs, and that means that support can help fix them.

I know, that I (personally) sleep better, knowing there is a professional solution from a big-brand running in a "boring, enterprise" production.

zrm · on May 25, 2020

> If I do that with a linux/solaris/mdadm/zfs/btrfs/whatever, and it fails, who will be the one blamed for the failure? Me of course.

So don't create a 16-drive RAID5 at all.

> If i select a propper raid setup, and it still failes, it's better if it fails on an IBM than on my duct-tape linux/mdadm/zfs/btrfs/whatever solution, because again, I personally can say, that i did what the vendor (IBM) recommended in their brochure, and their solution failed.

But you're the one who chose the vendor, and evaluated their solution, and accepted it. So you're still fired.

> Either way, you have to restore from backup.

The point is there's a third option. Use a level of redundancy appropriate to the importance of the system, and then it doesn't fail to begin with and it isn't necessary to divert blame. Meanwhile you get to show your boss how you saved the company millions of dollars.

> And if there is some random bug (eg. kernel oops when mounting volume), there's IBM support on the call. Working with standard options, means standard, well-known bugs, and that means that support can help fix them.

Which you can still get for linux/solaris/mdadm/zfs/btrfs/whatever from a variety of vendors if you really want it. But because they're generic technologies, this also means you can get that support from multiple competing providers (so the cost is lower), including per-incident support once something happens that requires it instead of having to buy a support contract you may not ever need.

Which you generally won't need, because the likes of mdadm are more widely used (and so better tested) than any individual vendor's proprietary hardware RAID solution, and can also recover more easily from things like controller failures because the replacement controller needs no state and is not required to match in firmware version or even model.

1bc29b36f623ba8 · on May 25, 2020

The grandparent does have a point when it comes to assigning blame; when you have a hardware-based solution then one can always blame the hardware manufacturer. If you use an open-source solution then there is typically no warranties, and then blame therefore falls upon whomever implemented the solution.

Management likes accountability so the above isn't going to change any time soon.

Now, obviously, the way around is to use enterprise-level solutions based on software. It is often more expensive than the hardware route, but as you say, there are plenty of vendors who do offer such solutions with a proper SLA.

wodenokoto · on May 25, 2020

> But you're the one who chose the vendor, and evaluated their solution, and accepted it. So you're still fired.

No, that is the whole point of the "Nobody got fired for buying IBM" meme. If you buy IBM and it fails, even your corporate, non-tech boss knows (or believes) that you couldn't have done any better than what you did.

If you build your own solution or buy from a "lesser" vendor, when things fail, you will be questioned about the solution.

> The point is there's a third option. Use a level of redundancy appropriate to the importance of the system, and then it doesn't fail to begin with and it isn't necessary to divert blame.

That doesn't make any sense. The hypothetical here is that we are comparing the failure of the best homebrew solution to the failure of the appropriate big corporation solution.

It doesn't make sense to say that the solution to failures is to not have any.

zrm · on May 25, 2020

> No, that is the whole point of the "Nobody got fired for buying IBM" meme. If you buy IBM and it fails, even your corporate, non-tech boss knows (or believes) that you couldn't have done any better than what you did.

"Nobody got fired for buying IBM" isn't a meme, it's an IBM marketing slogan. In real life your boss cares a lot more about whether something failed on your watch than why.

> It doesn't make sense to say that the solution to failures is to not have any.

It does when the open source solutions have a lower failure rate because they're more widely used, better tested and less "integrated" which dramatically increases flexibility in quickly implementing solutions to unforeseen problems. It also leaves more budget for real improvements, because RAID6 with a hot spare on commodity hardware costs less than RAID5 with no spare from "enterprise" vendors.

Reliability engineering is science. Hardware will fail. Software will fail. So you include redundancy for the hardware with the highest failure rates, use more redundancy for systems with higher importance, and use modular widely-available commodity parts that can be quickly replaced and sourced from multiple independent vendors. For systems with even higher availability requirements, use system-level redundancy and keep regular snapshots on access-restricted independent backup systems, etc.

Money spent actually preventing failures is more effective than money spent to acquire a scapegoat.

newacct583 · on May 25, 2020

Still, those enterprise junior executives didn't get fired for choosing mainframes! Their companies made bad decisions in hindsight, but they (and we) blame those bad choices on IBM having failed to meet the needs of the market, not on whatever random Fortune 500 companies bought S/390's instead of PC servers.

Which is the point. IBM was the conservative choice, and they remained so even past the point where they were the wrong choice. If you're a junior executive trying to climb the ladder, that kind of "decision security" has real value.

ajsnigrutin · on May 25, 2020

> Still, those enterprise junior executives didn't get fired for choosing mainframes!

This!

> Their companies made bad decisions in hindsight, but they (and we) blame those bad choices on IBM having failed to meet the needs of the market, not on whatever random Fortune 500 companies bought S/390's instead of PC servers.

Those companies didn't make bad decisions... the mainframes worked. The might have been more expensive, and slower than a bunch of PC servers, but they did the job. We now know, in hindsight, that PC servers were a better choice, but which one of us is willing to risk our own money/job for something that is currently something new?

There was even a time when tablet computers (think ipad, not x series thinkpads) were "the future of computing", and that "they would replace the need for PCs for everyone"... now (in hindsight) we know that tablets are useless for most people (except for consuming entertainment), and that work computers are still boring PCs with keyboards and mice.

emsign · on May 25, 2020

That's funny I use an IBM rebranded RAID card. In AHCI mode with ZFS.

tw04 · on May 25, 2020

Software RAID has always been superior - it's just that the vendors who were doing it right built enterprise arrays on their software stack and printed money. There aren't many standalone entities anymore, but the remains still exist. 3Par and Nimble (now HPe), EMC (now Dell), Solidfire (now NetApp), NetApp, Infinidat, IBM, etc.

And ironically, the biggest competitor in the NAS space to NetApp back in the day essentially went bankrupt BECAUSE they used hardware RAID and ran into a TFDL situation that cause customers to flee in droves.

https://en.wikipedia.org/wiki/Auspex_Systems

arsome · on May 25, 2020

Software raid is better, but it is worth picking up some LSI cards, just to put them in IT mode and pass them all through for software raid.

Most of the cheap SATA cards you can buy have absolute garbage controllers and if you mix them with absolute garbage drives, some terrible failure scenarios occur, I've seen one failing drive take out a whole card, constantly knocking several drives offline in my RAID6 array. This was an absolute hell staring at smartctl and RAID repairs every few weeks until I finally gave up and swapped the controller, once that was done, the single drive failed cleanly and got kicked out without knocking the whole array offline.

StillBored · on May 26, 2020

Just make sure they aren't of the megaraid variety. The data integrity/speed trade-off on those boards is probably the worst in the industry.

What you are describing though sounds like a SATA card with a port expander. Most normal AHCI cards are fairly decent and the ports are independent. So loss of a single port won't take the whole thing offline. This is basically the same setup you have for the motherboard ports.

louwrentius · on May 24, 2020

I kept my article neutral and I realise I should have stated that I consider Linux software RAID and ZFS in the same boat in terms of perceptions about RAID.

People forgo on both Linux software RAID and ZFS because of scary stories.

ZFS is an interesting alternative to Linux software RAID that discussion is a different topic all together.

garmaine · on May 24, 2020

The two fundamentally are different in ways that you are conflating. A hardware raid card will eject a disk that shows a single URE. Your zfs volume will happily correct the error and just report it as a minor statistic that is cleared on your next reboot.

Yes zfs is immune to the problem the zdnet article hyped up. Because it is immune by design in a way that is purposefully different from raid.

StillBored · on May 26, 2020

That completely depends on the RAID controller/software.

Smarter ones, with sufficient slots will leave the "failing" drive online long enough to assure that should a sector fail to be read from the remaining drives an attempt can be made to read it from the "Bad" drive as well. In cases where the drive is being kicked for excessive SMART relocation, or hard read errors this could be the diff between the array going offline and the rebuild succeeding.

For something like mdadm, you can simulate this by upgrading /reshaping from RAID5 to RAID6 (or similar) and then pulling the failing drive and reshaping back to raid5. Or as did a few times a decade+ ago when we had a really bad batch of drives, pick the bad sector off the failed drive, compute the corrected sector "by hand" and write it back to the drive with a read error.

layoutIfNeeded · on May 24, 2020

Scary stories of ZFS? I thought it was the pinnacle of file systems in terms of reliability.

dehrmann · on May 25, 2020

I think this has been fixed, but at one point, I used a log device with a one-time encryption key, thinking I didn't care that much about the data if there's a power loss. For the most part, ZFS didn't care either, but what it did care about was not mounting a pool with a missing disk. I had to edit metadata on the pool so it would mount with a different log device. Tools for this didn't exist--I was reading specs and source code, learning how ZFS know which devices belong in a pool.

magicalhippo · on May 24, 2020

I mean there are some people who have lost ZFS pools or a lot of data.

But for each one of those, there are tons more that have lost entire hardware RAIDs.

There's also tons more of those that had some horrible error and imagined their ZFS pool was gone for sure, only to find they just lost a couple of files, if anything at all.

deelowe · on May 25, 2020

The only issues I've had are with fragmentation and worrying about losing a disk during rebuilds.

dehrmann · on May 25, 2020

I, too, have had this experience, but how is this possible? So much other experience has taught me that hardware is better. The main downside I've had with ZFS (over hardware RAID) is performance, but it's been incredibly reliable.

kietdlam · on May 25, 2020

It's because ZFS is much more than RAID. RAID just exposes a disk to the OS. ZFS exposes a filesystem and knows more metadata, e.g. the checksum, so it knows when to repair the data using the redundancy from the disks.

heavenlyblue · on May 25, 2020

Which other experience?

dehrmann · on May 25, 2020

Softmodems, various virtualization technologies, emulators.

heavenlyblue · on May 26, 2020

You mean hardware emulators are better than software emulators?

KaiserPro · on May 24, 2020

VFX sysadmin here

> Setup alerting if you care about your data

Yes. and make sure that you have the correct signal to noise ratio. Otherwise you won't act on it.

> scrub everything

Yes, this too.

One thing that I would add is avoid mdadm like the plague. Its tricky, poorly documented and requires a whole bunch of other crap tools to make it work.

ZFS or proper hardware raid, there is no inbetween.

On the subject of loosing a raid during rebuild, I've had that, but I can't be sure that it was because of simultaneous disk failures. The raid in question was a 60 drive array with 8TB drives. There were 4 raid 7 groups with 4 spare drivers.

After each rebuild, a new drive would fail. over the space of two weeks we changed about 6 drives. Because the raid was under huge pressure, it didn't have the time needed to rebuild (24 hour rebuild time jumps up to 96 hours)

In the end we migrated the data to another array.

louwrentius · on May 24, 2020

> One thing that I would add is avoid mdadm like the plague. Its tricky, poorly documented and requires a whole bunch of other crap tools to make it work.

Strange, I have exactly the opposite view. I think ZFS is exceptionally well documented and easy to use.

But MDADM is just as simple. What other 'crap tools' are you talking about?

Regarding your RAID60 experience: it could be a bad batch of drives.

Also, I wonder if they had a good solid 24H break-in period with a significant load to weed out bad drives before the array was taken into production.

KaiserPro · on May 24, 2020

ZFS's documentation is as good as the docs for expensive storage. (HEllo IBM redbooks, they are _wonderful_)

To use mdadm properly you need lvm as well. Which means dealing with all the rubbish that come with it. (unrecoverable snapshots as a service? that might be fixed now.)

> Regarding your RAID60 experience: it could be a bad batch of drives.

Naaa it was the workload. We had 32 of these server that were largely identical, and bought in two batches. This one (it was actually two) server was running the blackhole fluid sim for interstella. These servers had all been up and running for well over a year.

I should point out that with about 2000 drives for these servers, we would be replacing at least three disks a week. Normally they'd be spread evenly over all servers, it was unusual that this array ate ~7 drives in two weeks, with back to back rebuilds.

Interestingly, the nearline machines which had 4x the drives, was much less likely to loose disks. (however they had a different usage pattern, virtually no random access, and it was one raid controller on the top 60 drive array, then three jbods daisy chained.)

bc4m · on May 24, 2020

What is the mdadm use-case that requires lvm? I've used mdadm happily for over a decade and never used lvm.

nine_k · on May 25, 2020

If you want snapshots on XFS, you want lvm, else (AFAICT) you can only make a snapshot while the XFS volume is offline.

louwrentius · on May 24, 2020

Ok clear story, thanks. LVM is a different story. To be frank, my article was focused on the smb / home nas user.

In enterprise environments, it’s a different story.

derefr · on May 24, 2020

> One thing that I would add is avoid mdadm like the plague. Its tricky, poorly documented and requires a whole bunch of other crap tools to make it work.

Do you feel that Linux MDRAID is a bad technology, or just a bad UX? I.e., are MDRAID-based appliances (like many NASes)—where all the sharp edges have been smoothed away—still bad in some fundamental way?

KaiserPro · on May 24, 2020

Oh no, its a fine technology, its just useless documentation and to make it really work you need LVM, which is just horrific to deal with.

I should say that I've worked proffeshionally with ZFS, mdamd/pv/lvm gluster(ugh) lustre(ok) clustered xfs and GPFS.

Along with a bunch of appliances.

mdadm/pv/lvm is about the bottom three. After gluster and ceph.

core-questions · on May 24, 2020

I find LVM2 to be easy to work with, but the weird marriage between MD and LVM confuses me, so I just use "hardware" raid beneath LVM.

(For server OS installs that I can recreate trivially, and not for actual data, which lives on a proper SAN because I value my employment)

joana035 · on May 24, 2020

How come mdadm needs lvm?

Do you mean device mapper?

KaiserPro · on May 25, 2020

I don't mind device mapper, for what I used it for, multipath scsi.

mdadm doesn't need lvm. But if you want snapshots, or to have flexible partitions (anything like the flexibility of zfs) you need lvm.

Also for the longest time redhat used to force lvm on you by default. I assume this was to help with "flexibility"

In practice, on virtual machines, or anything back with decent storage, I'd remove the partition table and just run on the block device directly.

joana035 · on May 24, 2020

mdadm is just fine. YMMV depending on the distro you are using and your knowledge/know-how about storage.

I recommend to try it out with loop devices, play around and build your own opinion about it.

dopylitty · on May 24, 2020

Don’t overlook the advice regarding setting up alerting.

A long time ago I was managing several devices which ingested logs for a large enterprise. One day one of the devices suddenly couldn’t handled the load and sent an alert that it was dropping log messages.

After a lot of digging and headscratching it turned out the battery backup on the device’s hardware RAID card had failed so (IIRC) it was operating in a mode without the onboard memory cache and couldn’t handle the write load anymore. We had to run to the data center and replace the entire 2RU device because of this.

If we had alerting set up on the backup batteries at least we might have been able to react quicker.

phire · on May 24, 2020

I've had worse.

Someone has set the machine up with a raid 10 of SSDs and no monitoring.

At some point the RAID controller battery died, increasing the load on the SSDs. Who knows when that happened, no monitoring.

Then I guess the SSDs failed one by one. I have no idea in which order, or how long the gaps were between the failures. Like I said, it would have been nice to have some kind of monitoring.

Eventually the 3rd SSD died, and the machine died, finally calling attention to the problem.

To top it all off, it had recently been discovered that the backups were incomplete, and everyone was still arguing over the best way to close the backup hole.

So we forced to pull the SSDs and send them off to professionals for recovery. I believe we spent several thousand for express recovery.

So yeah. Don't forget monitoring/alerting.

funnybeam · on May 24, 2020

As I find myself frequently repeating - redundancy without monitoring is not redundancy

louwrentius · on May 24, 2020

Thanks for sharing, that was probably a terrible experience.

6nf · on May 24, 2020

Monitoring is critical. Without monitoring it's just a time bomb.

You need to set alerts and you also need to audit and test the alerting system every few months. Is so easy to forget that your alerting system depends on your email server or some network port or something that gets changed or removed without updating the alert system.

m0xte · on May 24, 2020

RAID at home is dead IMHO unless you have a crazy huge amount of data. If you have a desktop PC then chucking a PCI/NVMe bridge card in it and populating that with Sabrent 1TiB NVMe M2 sticks works out cheaper, faster and more reliable. Unless you need several TiB of contiguous file storage keep them as separate volumes. Plus you don’t have to deal with separate hardware built by the lowest bidder and quite frankly terrible NAS or RAID implementations.

You also still have to back up both solutions to offline storage. From SSD this is a magnitude less painful at those data volumes.

sz4kerto · on May 24, 2020

I don't want to keep my PC on all the time, but I want my media to be available. I also don't want to completely rely on cloud. I also want to share data with my family members, without cloud sync. I also want my IP cameras to record stuff locally without another SaaS subscription.

I'm obviously too conservative.

So I'll always have a NAS.

Multicomp · on May 24, 2020

Maybe there is an alternative to a dedicated NAS via a syncthing machine?

Doesn't rely on cloud

Can share data with family members, no cloud sync needed

IP cameras recording locally is not clear whether it's to a UNC path or to local on device storage. I'm gonna take it as recording over lan to UNC path and say yeah syncthing isn't the best for this but it could sync that path elsewhere.

brnt · on May 24, 2020

This is what I do, heterogeneous synced storage through Resilio (but could have chosen Syncthing). One of my servers is even a phone (cracked screen).

m0xte · on May 24, 2020

I have roughly the same requirements. For me it worked out more time and cost efficient to just leave the PC on. Logistics, management and maintenance were a pain when I had a NAS.

Lammy · on May 24, 2020

I just split the difference and built my NAS PC with power efficiency in mind. Big fan of the ECC-supporting C3xxx Atoms. I have an eight-core version (which is total overkill here, even) in my NAS with ECC DDR4 and a 8x8TB RAID-Z2: https://www.supermicro.com/en/products/motherboard/A2SDi-8C+...

zbrozek · on May 24, 2020

I'm using a Xeon-D machine, quad cores, with ECC DDR. It's overkill for just storage, but I also run a bunch of VMs on it. It's my router (pfsense), my storage box (samba w/ZFS on SSD array), my ssh utility knife, my "run vendor crapware" (e.g. Lutron's RadioRa software) Windows box, etc. QEMU is fantastic.

My home's IT power draw is about 45 watts, using two access points, that server, and a bunch of little widgets (e.g., Rainforest power monitor, Lutron bridges, etc).

That will go up when I add a big switch, but it's not bad for the utility I get.

blattimwind · on May 24, 2020

The thing is, if you don't want to loose a lot of performance on that storage, either a bought or a built NAS will set you back 1000+ bucks; you can run an idling desktop PC for at least 5 years for that price (80 W idle power consumption, 30 ct/kWh electricity cost).

Lammy · on May 24, 2020

There are more dimensions to my decision than the raw cost. I sit next to it and enjoy it being relatively silent.

m0xte · on May 24, 2020

My PC is almost 100% silent.

blattimwind · on May 25, 2020

Did you manage to do that with hard drives?

m0xte · on May 25, 2020

No which was part of the original point.

ClumsyPilot · on May 25, 2020

Management is a pain? I have a Synology NAS, and in 8 years I can't remember anything I had to do to it beyong initial setup and port-forwarding. You could argue they are underpowered, not cost efficient, or whatever, but technically I had less issues with it than I did with OneDrive.

Exmoor · on May 24, 2020

Your definition of "crazy" huge amount of data is pretty conservative. Your proposed solution would top out at 4TB, while most people I know who spinning up their own RAID at home are looking to store multiple tens of TB. For the cost of the solution you propose (4x 1TB @ $150/ea + $50 NVMe) you could set up an RAID-type solution with 32TB of raw capacity (4x 8TB @ $140/ea + $90 for PCie card and cables).

Wistar · on May 24, 2020

My photography archive, including about 10 years of high-volume commercial photography, is about 14 years old and at about 23TB. At present, it grows about 5GB a day. I have stopped routinely shooting RAW because storing it all has become a mess.

m0xte · on May 24, 2020

Most RAID solutions I’ve seen in SOHO environments are 2 bay Qnap etc. Usually top out at 2TiB online. I’d argue anything larger is exceptional. <10% population.

loeg · on May 24, 2020

> tens of GB

typo -- tens of TB?

Exmoor · on May 24, 2020

Indeed, fixed. Thanks.

giantrobot · on May 24, 2020

It's still tens of GB, lots and lots of tens of GB.

loeg · on May 24, 2020

These things are orthogonal. You can RAID your NVMe storage, or not.

I think you're imagining a dedicated network storage appliance with spinning rust, which is one type of product that might use RAID. But really it's just a computer with some disks in it. You can use RAID in any computer with disks in it.

cwiggs · on May 24, 2020

Hm, I'm not sure i understand.

The 1tb NVMe Sabrent SSD on Amazon is $150: https://www.amazon.com/dp/B07LGF54XR/ref=cm_sw_r_other_apa_i...

You said you could put multiple in the PC so we can assume at least more than 1. So just for the SSDs, not including the NVM PCIe interface we are looking at $300 for 2TB.

In comparison for $100 you can get a 4TB WD red: https://www.amazon.com/dp/B083XVY99B/ref=cm_sw_r_other_apa_i...

That's a lot more $/GB than you could get with HDDs.

I have 4 2TB HDDs in a raid 5 array at home and all 4 disks didn't cost $300 and I have 6TB usable storage space.

So how is it cheaper to go with NVM SDD for home use?

loeg · on May 24, 2020

Even this discussion of storage media is entirely orthogonal to RAID. Flash fails too! There is value in redundancy in flash storage applications.

cwiggs · on May 24, 2020

Good point. I think it might be more important with SSDs actually when they fail they typically fail 100%. When HDDs fail you can sometimes get some data off or even use software tools to repair the disk(s) and get all your data back.

loeg · on May 24, 2020

There are a variety of real-world SSD failure modes that depend on the specific controller, in addition to environmental factors (NAND doesn't like heat). I would not suggest using this reductive failure model for storage planning.

m0xte · on May 24, 2020

It’s actually the other way round I have found. I suggest you have a look around at other sources. Rust tends to go clunk at spin up time and that’s it. SSDs soft fail first due to reallocations but are still readable.

blattimwind · on May 24, 2020

Both have very different failure modes depending on what is failing. Controller / electronics failure tends to instantly brick either. Some hard drives experience a single failure event that wipes out a bunch of sectors, but they work fine afterwards and if you overwrite the affected sectors those are back to being fine as well, because they are relocated. For other drives this is ongoing, were more and more of the drive becomes unreadable and even unwritable. I believe both of these are platter surface damage, the latter perhaps R/W head or amplifier degradation. Bearing and motor issues tend to make themselves known in advance, if you listen, but at some point it will stop spinning up or seize while running; instant loss (but a high chance of data recovery, since both media and controller are intact).

For solid state drives a lot of the issues are firmware problems. Things like drives becoming very slow over time have been caused by things like imbalanced internal data structures. Some are media related, e.g. some drives experience slow reads on old writes, because the voltage levels have decayed which necessitates internal retries or slower read speeds. Things like SSDs failing to enumerate are probably controller issues.

In my experience both can fail either slowly, with degraded performance, or fast and total. However, known-good SSDs (most of Samsung, Intel) tend to be very reliable. Good hard drives too.

kalleboo · on May 25, 2020

Purely anecdotal, but I've never had a hard disk just outright fail 100%, they always start going click click click on certain areas of the disk, and avoiding those files you can rescue the rest. SSDs on the other hand, I've never had one properly go into a "read-only" soft fail mode, they just fail to show up one day.

This is all among consumer drives though, enterprise stuff probably behaves better?

I'd love to see some kind of statistics on actual failure modes, but aside from Backblaze, everyone keeps their storage statistics private.

m0xte · on May 24, 2020

Add a NAS unit to that cost and the recovery time, dire random access time to your estimations and it works out.

I’ve had to bounce around terabytes of crap on spinning rust for years. One nice XFS to XFS copy job took 2 days due to small files. Same thing on SSD was 3 hours.

I’ve also dealt with restoring a hosed 4 TiB NAS from another hard disk array. 4 days recovery sound like fun?

loeg · on May 24, 2020

You're still conflating RAID and NAS. These things are orthogonal.

matheusmoreira · on May 24, 2020

> Unless you need several TiB of contiguous file storage keep them as separate volumes.

Then we have to manually keep track of information such as which volumes contain which files. Isn't that the job of a file system?

m0xte · on May 24, 2020

It’s not hard to manage it. I managed it before my computers even had hard disks.

matheusmoreira · on May 25, 2020

Yes, but we shouldn't have to do it. The computer should be tracking this minutiae for us. File systems should be able to automatically use new storage devices in flexible, fault tolerant ways. That way we can keep all our data under one root.

It's normal to have as many file systems as there are storage devices. That's backwards. There should be one file system on top of all devices. This is one reason why ZFS is so good.

philjohn · on May 24, 2020

And with services like Backblaze offering cheap offsite backup, it's a no-brainer.

My home internet connection is 10:1 upload, but that's still 38Mbps up which is more than enough to backup several TB - sure, the initial backup will take a while, but it then keeps up with incremental changes without any issues.

hnick · on May 25, 2020

Backblaze pricing is apparently per computer. Are there any licensing or technical issues with routing backups from all our devices through a single computer like my FreeNAS box?

m0xte · on May 24, 2020

Exactly that. Only 15mbits up here but that’s fine for incremental backup. Recovery story is better which is the often forgotten bit. Downtime during recovery is a proper pain.

user5994461 · on May 24, 2020

I find the article super confusing. It goes at length about how RAID is fine and array failure is a myth, but doesn't mention RAID 5 until three chapters down.

The only issue with RAID has only ever been with RAID 5 specifically. Don't use RAID 5. Definitely not with large drives or more than 4 drives.

The author runs one disk check that immediately detects one disk is dying. After replacing the disk without the whole array exploding, he concludes that the risk of losing data is a myth??? WTF. It was really one inch away from losing all the data. If anything it proves that having a single redundancy is not enough to store critical data, there is always at least one disk about to fail in a large multi disk array.

chiph · on May 25, 2020

Back during the dot-com era, we lost an array to the "Deathstar" drives. They were all from the same production batch (we bought them at the same time). So when one died the others were not far behind. It was actually comical, in a sick & sad way. When the first drive died, we tell the data center tech to go replace it. But before he could walk over to our cage with the replacement, the second drive died. We tell him to run. And then the third drive died and the data was toast.

jugg1es · on May 24, 2020

I have also only ever had problems with RAID 5. Many years ago, we ran a hardware RAID 5 and it was really very slow. I eventually bought some extra drives and switched to RAID 0+1 with much greater results and comparable availability.

bc4m · on May 24, 2020

If you can afford to purchase 2x the disks for RAID10 you will certainly get better performance and availability. RAID5 is an approach that gives you lower cost in return for slower writes and less fault tolerance. Everything is a trade-off.

jugg1es · on May 24, 2020

definitely true - but it's really hard to gauge what performance will actually be until you have the whole thing setup even with benchmarks.

6nf · on May 24, 2020

RAID 0+1, is that the same thing as RAID 10?

greenknight · on May 24, 2020

Nope... RAID 10 or RAID 1+0 is 2 or more RAID 1's that are RAID 0'd together

RAID 01 is 2 or more RAID 0's that are RAID 1'd together

jugg1es · on May 25, 2020

yea I had a mirror of striped drives. Gave you max throughput with some protection against failure.

user5994461 · on May 25, 2020

This article explains the difference in more details https://muetsch.io/why-raid-10-is-better-than-raid-01.html

RAID 10 works by stacking pairs of disks, as many pairs as you want, each pair is redundant by nature. It's common to do with 8 or 12 disks, filling the whole front bay of the server. It can survive many drives failing as long as it's not two drives from the same pair.

RAID 01 splits data differently. It's much more likely to lose the whole array if you lose any 2 disks, or god forbids, 3 disks.

RAID 10 is the proper level. RAID 01 is dead and almost nothing supports it, it's most likely a mistake or a typo if you heard about it used in production.

jugg1es · on May 25, 2020

Our primary concern was absolute maximum performance on huge numbers of tiny files (terabytes of video cut into single-frame TIFFs) for video processing while also having at least some protection against a single failed drive. We were on a shoestring budget, so we couldn't upgrade our network to use fiber, so a SAN wasn't really feasible at the speeds we needed (SSD's also didn't exist yet). You could be right, but I know that I chose the one with the fastest possible performance and made sure there was an audible alarm if a disk failed.

louwrentius · on May 24, 2020

No, my array was fine and all drives were fine, except that one drive that was affected by a lot of bad sectors.

The scrub did exactly what it was supposed to do. Because the drive seemed fine but it was not. It would have killed my array if any of the other drives would have failed.

number6 · on May 24, 2020

This sounds to me like: "It was really one inch away from losing all the data"

You discovered the failing disk by chance; if chance hadn't been in your favour it would be the other way round and an other drive had failed on you.

louwrentius · on May 24, 2020

No, I think you misunderstand: read about the scrubs, they run weekly or monthly. Nothing accidental about that.

I keep my server off most of the time. And that is a nice way of missing scrubs. In my case the scrub did what it needs to do: find disks with bad sectors. I should do something extra in the future to run scrubs if they miss their schedule.

nine_k · on May 25, 2020

RAID 5 is a reasonable way to add redundancy to three or (worse) four drives, sourced from different batches (or vendors).

Sometimes it's all you can afford, or could until ZFS became available for Linux.

For bigger setups, definitely other schemes are better.

crmrc114 · on May 25, 2020

The comments here are a treat to read. I have used hardware raid for years. But I stick to only raid 1/0 or combinations of those two.

With hardware from 3ware/lsi/broadcom or whatever their name is this week I have never had an issue or a rebuild failure. With software mdadm managed arrays in Linux I stick to the same raid types and have never had any issues.

One of my home arrays alone is over 50tb. Systems at work with an unnamed faang company are much larger. The cattle commonly have raid 10 arrays as part of their recipe.

I was unaware people were scared of raid.

As I tell everyone though raid=\=backup, common mode failure can strike anything.

maallooc · on May 24, 2020

RAID is for speed and avaliability. It's not backup.

I personally use home grade 2x 4TB SSD RAID 1 for my media and server grade 2x 1TB SSD RAID 1 for my documents. The reason I use RAID 1 is because if one drive fails I should be able to use the other drive while I order a replacement. I don't expect them to be my backup.

willis936 · on May 25, 2020

Tomatoe tomato. RAID uses redundant (ie backup) drives for availability.

seized · on May 25, 2020

Very wrong. Entire RAID arrays can be lost to fire, water/flooding, power surges, multiple drive failures, theft, etc. And if the controller dies you may be up the creek.

Backup is separate. Ideally following 3-2-1.

https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

willis936 · on May 25, 2020

Again, this is semantics. Redundancy is backup. That doesn’t mean it’s a good idea to keep data in one physical box in one physical location. I also never implied that. Please follow HN guidelines paragraph 3.

seized · on May 27, 2020

Redundancy is not backup based on industry standard terms. Redundancy protects against some hardware failures but not accidental deletes, filesystem corruption, malware, or a whole host of other things.

Even filesystems with snapshots can be lost and snapshots are the only time you can start blurring the lines between redundancy and backups in terms of protecting against data loss via deletes etc. And they are still not proper backups as you can still have a hardware failure significant enough to wipe out all your data.

A backup is a separate copy of data, either on separate hardware or off-site or a other combinations. Based on best practices and standard terminology.

No one uses "redundancy" to refer to separate copies of data especially when discussing RAID in contexts like this discussion. So the only interpretation of your comment is the weak one that I responded to.

ars · on May 24, 2020

Suggestion to anyone setting up RAID on Linux: Use lvmraid, rather than mdraid directly (lvmraid will use mdraid behind the scenes).

The flexibility it gives you for later is very large - things like migrating to new hard disks, adding disk space, adding a cache layer (lvmcache), it's quite worth it.

remote_phone · on May 24, 2020

This article is surprisingly naive.

I’ve worked at a well known enterprise storage company. The biggest thing drilled into us was that we couldn’t trust anything reported to us by the hard drives. Hard drives can report back to the OS that the write succeeded even when it didn’t. When something like this happens you’re in for a world of hurt, especially over time. This isn’t a bad sector, it’s corrupt data.

louwrentius · on May 24, 2020

Sure, at scale this kind of thing will happen. At scale, even small risks become a certainty.

My article was more geared towards home / small SMB users.

Silent data corruption will happen. Will you hit that at home? Probably not.

But maybe people should stop buying Synologies, QNAPS and all that kind of hardware, I don't know ;-)

remote_phone · on May 24, 2020

“Probably not.”

How can you even begin to quantify this risk if you have no idea what the problem is, or if you’ve never actually checked? That’s irresponsible to tell people “nah you being run into this problem” or “Eh, RAID5 works great!” when you’ve never looked into the stats or supported people who end up getting all their data lost?

This is a real problem and anyone not running two drive redundancy in 2020 on data that they care about is going to lose their data eventually.

If they don’t care about the data, then sure it’s not a big deal. I myself have 2 Synologies where I back up one to the other. The backup Synology is raid 5 because I don’t care so much about losing data, but my main drive is RAID 6.

magnawave · on May 25, 2020

I think his point was about drives silently returning corrupt data, not that you shouldn’t have redundancy. Absolutely you should have redundancy and backups.

At the (tens of) thousands of drives scale (and where you treat drives as cattle) having some extra checks and balances for what you write is an excellent idea. Even better is doing that in a distributed way so multiple machines make that same decision. Occasionally you will run into a drive where the firmware has jumped the shark or some such. (Or CPUs or memory or bus issue for that matter).

But generally speaking drives DO know when they are returning bad data and will error before they will do that. The odds are about as good as other forms of hardware errors that will eat your data.

paulryanrogers · on May 24, 2020

What is the alternative? To read back the data immediately, periodically, separate-block-level check sums?

mjb · on May 24, 2020

Yes, that’s what you need to do if you want to approach the theoretical durability of any storage system. If you’ve got durability ‘budget’ elsewhere in your design, you might not have to do this.

johnklos · on May 24, 2020

Thank you for this. I'm tired of the FUD surrounding RAID. It seems that the same anecdotes are passed around without any vetting, and it helps nobody.

jbritton · on May 24, 2020

I had a RAID level 1 on a desktop Windows machine. My machine would frequently crash and then the RAID would have to fix itself which made the machine unusable for the entire day. You could barely move the mouse until the rebuild was finished.

to11mtm · on May 24, 2020

Sounds like something was wrong.

I don't know if it's anything like this still today, but ideally for RAID setups to work well the drives should be as close in spec as possible (ideally, the same drives). Mixing drives too far mismatched leaves it up to the controller to keep drives coordinated.

When we did RAID 1 for a setup at work (lol budget constraints but we needed something for storage) IT insisted that we at least had a proper hardware RAID card. I can't remember whether the rebuilds were fast, or if they ran online and just slowed us down for a bit, but we were never down restoring for more than a couple hours.

The irony of the situation was that we actually were blowing ALL the hard drives more frequently in that box. At first we thought it was just the power supply (Oh, that day was terrible. PS blew up alongside one drive, and when we replaced drive+PS the tech may have picked the wrong direction to restore :X)

But, as it turned out the case just was too dang hot to run 2 5400 and 2 7200RPM hard drives at once. (Yes, it had enough bays to fit it all.) Even after beefing up the power supply we would have to replace a drive every 9-12 months, and usually they were in the same location on the case.

FWIW it was an i7-920 with 12GB ram, nice SI Raid Card, running Server 2008 Terminal server. (Just... don't ask.)

singron · on May 24, 2020

> when we replaced drive+PS the tech may have picked the wrong direction to restore :X)

In my experience, we lost RAIDs mostly to similar human error. RAID protects against a relatively infrequent but catastrophic failure. Since it's not constantly exercised, by the time it's needed, you might realize that the configuration or processes used aren't correct anymore.

Other technologies force more automation and reduce human errors. E.g. if a disk fails in a distributed filesystem, the FS will have already re-replicated the data before anyone needs to swap drives (and without requiring a spare drive in every machine).

Izkata · on May 24, 2020

> ideally for RAID setups to work well the drives should be as close in spec as possible (ideally, the same drives).

Everything I've read says this is a bad idea, because it's much more likely they'd die at the same time...

funnybeam · on May 24, 2020

You want the same model but make sure they are from different batches.

When buying SANs the vendor will routinely mix the disks supplied to minimise the chance of any manufacturing defects taking out multiple drivers at once

to11mtm · on May 28, 2020

Yeah I should have been more specific there. For RAID you probably don't want the exact same batch for all drives, accident waiting to happen.

Now CD/DVD-Burners on the other hand, if you ever want to run 4 of those at once, at the shop we found your best bet was to get sequential serial numbers, back then even a firmware rev could make a multi-burner setup squirrely.

But I suppose that's a different form of Array...

pessimizer · on May 24, 2020

Same drives, good; same batch, bad.

secabeen · on May 25, 2020

The other weird thing about the standard RAID conversation is the assumption that every bit on your disk is equally valuable, and that losing a single bit is as catastrophic as losing an entire drive. Yes, RAID rebuilds are hard on a drive, and yes, you can get read failures during a rebuild that can cause data loss during a RAID5 rebuild. However, of the multiple TB stored on my drives, if I lose a few sectors to a read error on the rebuild, what are the chances that I will even notice? That error could be in unused space, or a movie I downloaded from iTunes, or any number of blocks that store necessary but non-personal data.

I wish that there was more data distinguishing total drive failures from single block errors and from everything in-between.

aSplash0fDerp · on May 24, 2020

RAID 1 is dead!! Long live RAID 1!!

Cloud had a meteoric rise due to the paltry size of storage at the time (edit: partly, compute too).

https://en.wikipedia.org/wiki/History_of_hard_disk_drives

10TB+ NVMe with an external interface changes the formula for large data transfers over slow pipes. Its faster to physically ship it or more likely, too cost prohibitive to use metered networks for distribution.

Long live RAID 1!!

013a · on May 24, 2020

I'm something of a data hoarder. 20TB array of four disks.

I tell everyone who is interested in this "hobby"; Synology. Their products are, bar none, the best in the industry. I'd estimate that their direct competition will never catch up. They're so easy to use and maintain, I would assert that if you can install a web browser and check your email, you could run a Synology NAS. And the prices don't even have much of a surcharge over DIY (maybe 30%? you could DIY a server for pretty cheap with hand-me-down enterprise hardware, so hard to compare, but Synology's prices are very fair).

Let me outline a few things Synology does that makes your life so much easier:

- All of the administration happens through a web portal. And its not a basic, geeky portal. Its literally a full desktop. They've implemented windowing, minimizing/maximizing, desktop shortcuts, a start menu, it feels like a literal desktop running in your web browser. Just YouTube some videos of "Diskstation Manager".

- But, its also just Linux. You can SSH into it. It supports everything Linux does.

- They don't use ZFS or mdadm; they've implemented their own raid system they call "synology hybrid raid" (SHR). It supports surprisingly smart heterogenous drive size utilization while still maintaining N-Drive redundancy; with, for example, a 2TB+4TB+6TB+8TB drive combination in SHR-1 (RAID-5 equivalent), only 2TB of drive storage is totally unused (12TB storage + 6TB redundancy). Traditional RAID-5 would leave 12TB unused [1]. They put btrfs on top of it, so its not totally custom.

- When a drive goes bad, the physical unit starts beeping at you. No configuration necessary. You can see which drive is dead via the LEDs on the front, swap it out without turning the unit off, and a rebuild begins; you don't even have to visit the amazing Admin UI if you don't want to. You can also have it email you if you want, which is far easier than setting up an email notification on a bare linux server.

- It has an app store. A server app store! Want Plex? One click and you're online. Want Minecraft? Done. Want backups? Their "hyper backup" software supports a dozen targets including another Synology NAS, their own cloud storage, Google Drive, S3, Azure, SFTP, RSync, a USB drive, Dropbox, bunch of other stuff, it supports encrypted backups, it supports incremental backups. Want dynamic dns? Built in, with multiple providers including Synology's own, no cost. Wordpress? One click. MariaDB? One click. Email server? Done. MediaWiki? LDAP? OAuth IDP? Easy. All of this is as simple as installing a web browser. Actually, its simpler.

- Want to share external links with friends, like Google Drive? Synology Drive. It has a document and spreadsheet editor, totally custom, and probably 70% as good as Google Drive (which is pretty impressive considering you've never heard about it). If you have internet which isn't NAT'ed in a way that blocks hosting a server, it'll work. Generate a public link. Send it. Done.

Synology is awesome. If you thought self-hosting a server was too hard, think again. Anyone can do this. The unit sitting under my TV has an uptime of three years (I do have a UPS that can power it for about an hour or two; this has been used many times). I literally cannot express how little I've touched it in that time. I maintain all of our company's production cloud workloads on AWS, and I'd feel comfortable moving some things to a Synology unit, if you can develop a better story around redundant networking and power (this is the real challenge with self-hosting; not the server itself, not data redundancy, but the pipes supplying it).

[1] https://www.synology.com/en-us/support/RAID_calculator

vetinari · on May 24, 2020

While I love my Synology too, but let's not paint too rose picture:

- They are not expensive; you are getting a bunch of software on top. Not everything is based on open source, there's lot of Synology own developed software (and services, like quickconnect) too.

- SHR and SHR2 are build on top of bog-standard mdraid. You can mount them in standard linux machine, you you ever need it.

- You have option of btrfs and ext4. On cheaper models, only ext4. Btrfs is running on top of mdraid, not in btrfs raid mode.

- I had a drive failure; it beeped at me, exactly as you described, but after swapping the drive, I had to run rebuild from the web UI. It didn't start rebuilding automatically.

- App store: some things are great - Synology Drive kicks Nextcloud's ass. The office, calendar, mail packages are very nice. Synology DNS is Bind and Synology Directory Server is Samba in AD mode. with some options disabled (you can't join AD domain as another domain controller; only one can exist). What's worse, many open-source packages are available in old versions, as if someone packaged it once, threw over the wall and they are seldomly updated. And you won't be able to install postgres, because DSM (the Synology system) uses one internally, ancient version at that (9.3.22 on my system). That's why so many use Docker (on models capable of running it. Not all models can do that!). Similarly to that, you are probably running an ancient kernel. Wireguard? Forget it.

- You don't need public IP or play with port forwarding (even though it is better to have public IP/port forward): you can use Synology quickconnect, which will either hole punch your NAT for you, or do forwarding for you. You can share your files with Drive, even if you are behind CG-NAT.

So is it perfect? No. But it is very nice.

matheusmoreira · on May 24, 2020

> They don't use ZFS or mdadm; they've implemented their own raid system they call "synology hybrid raid" (SHR). It supports surprisingly smart heterogenous drive size utilization while still maintaining N-Drive redundancy

This is their best feature in my opinion. It allows users to just buy random drives from time to time and add them to the NAS when they need to expand storage capacity. Unraid also offers a similar feature albeit with a different implementation.

Current open source solutions just don't compare. Expanding storage with Linux device mapper involves failing and replacing each storage device:

https://raid.wiki.kernel.org/index.php/Growing

Even btrfs wastes 6 TB when given its 2+4+6+8 TB example:

https://carfax.org.uk/btrfs-usage/index.html

I wonder why existing systems can't support this easy expansion. It'd be a huge help for home users who'd like to slowly expand their storage capacity.

gravitas · on May 24, 2020

> Expanding storage with Linux device mapper involves failing and replacing each storage device

The same process is generally true of mainstream enterprise controllers such as Dell PowerEdge/PERC and HPe ProLiant/SmartArray; each existing drive is failed out (in turn) and the RAID rebuilt onto the new larger drive, then the controller array container is expanded once all drives are replaced.

zozbot234 · on May 24, 2020

> It supports surprisingly smart heterogenous drive size utilization while still maintaining N-Drive redundancy: with, for example, a 2TB+4TB+6TB+8TB drive combination

You can do this yourself with ease by creating 2TB partitions on all drives and setting up RAID separately on each. Then you could use LVM to access them as unified storage, or keep them separate for extra flexibility when restoring.

fomine3 · on May 25, 2020

It's theoretically possible but I'm afraid of operation failure for such complex array. So It's nice to have simple user interface like Synology.

013a · on May 24, 2020

You can also replicate Dropbox quite trivially by getting an FTP account, mounting it locally with curlftpfs, then use SVN or CVS on the mounted filesystem [1]

[1] https://news.ycombinator.com/item?id=8863

singron · on May 24, 2020

What parameters are you using to make btrfs waste 6TB? With 2 copies (RAID1) and 2+4+6+8, that calculator shows it with 0 unusable space?

matheusmoreira · on May 24, 2020

I used RAID 6. That's 1 copy and 2 parity stripes.

singron · on May 24, 2020

Ah that explains it. Don't use parity with btrfs (i.e. RAID5/6). If you really want that, that's a good reason not to use btrfs.

matheusmoreira · on May 24, 2020

Because of the write hole, right? I've been avoiding btrfs because of that issue. Multiple copies are too wasteful and the parity implementation isn't perfect. Hope it gets fixed soon.

zrm · on May 25, 2020

That was recently fixed in mdadm. You can now specify a journal device (--write-journal) which closes the write hole. I don't see any reason btrfs couldn't do something equivalent.

ahnick · on May 24, 2020

The reason I avoided synology in the past was due to the lack of an ability to automatically detect and repair data corruption. It seems nowadays that their systems with BTRFS can automatically repair data corruption with the checksum. (https://blog.synology.com/how-data-scrubbing-protects-agains...)

Does anyone have any experience with this? Has it correctly caught data corruption for you? How does it compare to scrubbing on ZFS?

duskwuff · on May 25, 2020

> All of the administration happens through a web portal. And its not a basic, geeky portal. Its literally a full desktop. They've implemented windowing, minimizing/maximizing, desktop shortcuts, a start menu, it feels like a literal desktop running in your web browser.

That's a feature of Sencha ExtJS [1]. Synology didn't do the heavy lifting there.

> They don't use ZFS or mdadm; they've implemented their own raid system they call "synology hybrid raid" (SHR).

Which is just mdadm with extra steps. (Create a RAID array spanning all devices up to the size of the smallest device; repeat with the remaining space as long as there are at least two devices; create a JBOD of those arrays.) It's a kind of hokey approach, honestly.

[1]: https://examples.sencha.com/extjs/6.0.0/examples/classic/des...

louwrentius · on May 24, 2020

From what I can tell: Synology SHR is just MDADM RAID with some custom logic to slice up the drives so they can setup RAIDs with hard drives with different sizes.

cvubrugier · on May 24, 2020

Exactly! Suppose you have four HDDs: two 1 TB HDDs and two 2 TB HDDs. A single 1 TB partition is created on the 1 TB HDDs. Two partitions (1 TB each) are created on the 2 TB HDDs. A first RAID array (what you may call a slice) is created using the first partition of all four drives (4 x 1 TB, RAID5 = 3 TB). A second RAID array is created using the second partition of the 2 TB drives (2 x 1 TB, RAID 1 = 1 TB). Then the two RAID arrays are assembled with LVM (= 4 TB). SHR is basically LVM over MD.

magicalhippo · on May 25, 2020

Thank you for the explanation, I was wondering how that worked. How does it work if you then later add a fifth 4TB drive?

Anyway, I guess in this sense you could do something similar with ZFS, create a bunch of 1TB partitions and then make vdevs (slice) across a set of partitions on each drive.

Not sure if it's a good idea or not though. I do know you supposedly lose some performance when handing ZFS a partition rather than the whole drive, but maybe there's something else I can't think of right now.

cvubrugier · on June 1, 2020

> How does it work if you then later add a fifth 4TB drive?

If you later add a fifth 4 TB drive, two partitions (1 TB each) are created on it. The remaining 2 TB will be unused: a slice with a single element cannot be replicated.

The first slice is expanded from "4 x 1 TB, RAID5" to "5 x 1 TB, RAID5", so the logical capacity of that slice is now 4 TB.

The second slice is migrated from "2 x 1 TB, RAID1" to "3 x 1 TB, RAID5". MD is able to change the RAID level (aka the "personality") of a RAID array, which is pretty cool.

Then the LVM layer is made aware of the increased capacity of the MD layer with `pvresize`.

The solution of using "LVM over MD" suffers from its complexity because you have to manage two distinct layers. For instance, you have to run LVM / mdadm commands in the right sequence to replace a failed drive.

deepspace · on May 24, 2020

Agree. I used to do enterprise storage for a living and I deployed roll-your-own ZFS storage servers at home in the past. But these days I just use a Synology box. I have ~40TB spinning on 5+1 SHR and the thing has been rock-solid for years.

I tend to not run many apps on the NAS but I do use HyperBackup for local and remote backups, and it works pretty well.

chiph · on May 25, 2020

How do you have HyperBackup setup? I want to use external USB drives (so I can store copies offsite) and from what I've read it won't span backups across multiple drives - my Synology capacity is 20TB and the external drives are 8TB. So what I'm doing now is doing old-school file copies across to a desktop machine, which isn't ideal.

sz4kerto · on May 24, 2020

Qnap started to roll out ZFS-based NAS models, those also worth looking at.

nemo · on May 24, 2020

Yep, I have a Synology NAS and have been very happy with it. One other thing worth mentioning is that if you're using it as a media server, it supports streaming to a Chromecast so you can stream directly from the NAS.

barrkel · on May 25, 2020

Anything which can serve MP4 over http can stream to a Chromecast; the only trick is telling the Chromecast to start.

renox · on May 25, 2020

Chromecast doesn't support .srt subtitles :-(

danShumway · on May 24, 2020

General question for anyone who feels opinionated: should I be avoiding BTRFS in software Raids?

My understanding is that ZFS forces you to be a lot less flexible with which size drives you put into the array, which is what attracted me towards BTRFS in the first place. Aside from Raid 5/6 configurations, I've never had someone tell me to explicitly avoid BTRFS.

But whenever I see any articles about Raid, they're always using ZFS.

It's gotten noticeable enough to me that I'm left wondering if the general consensus is that everyone should just be using ZFS, even if that's something that I don't see spoken out loud very often.

ahnick · on May 24, 2020

I would say it is more like ZFS is the gold standard. You can use other filesystems for Raid, but none of them have the track record of ZFS.

You can use different sized drives in ZFS, but you will be limited to the utilization of the lowest capacity drive on the larger drives in the pool. In practice this isn't much of a problem, because you can just replace the lowest capacity drives first to grow your pool size over time.

magicalhippo · on May 24, 2020

> but you will be limited to the utilization of the lowest capacity drive on the larger drives in the pool

Small, but important, correction: the drive size limiting is per vdev. A pool consist of one or more vdevs.

So say you have a pool consisting of two mirrored vdevs, one with a pair of 3TB and one with a pair of 4TB drives, and one 3TB drive dies. If your local supplier don't have the same 3TB disk in stock, you can take a 4TB or similar and pair it with the old 3TB.

The pool will then still have a capacity of 3+4 = 7TB, but at least each mirrored pair will be intact.

magicalhippo · on May 24, 2020

> My understanding is that ZFS forces you to be a lot less flexible with which size drives you put into the array

That is true, however you can be somewhat more flexible if you go for plain mirroring. Then "only" pairs of disks need to be the same size.

Until not too long ago it wasn't advised to have wildly different sized drive-pairs in the same pool, but IIRC some work was done which alleviates this concern.

Mirroring also allows for more affordable upgrades in the sense that you "only" need to to replace a single pair with bigger drives if you want to increase capacity without adding more drives.

fomine3 · on May 25, 2020

ZFS RAIDZ expansion is in development here. https://github.com/openzfs/zfs/pull/8853

magicalhippo · on May 25, 2020

From what I can gather that will not change the fact that the space is determined by the smallest disk in the vdev.

However watching his presentation about RAID-Z expansion I couldn't shake the feeling that that is a rather arbitrary decision. I mean, if you take his slide visualizing the reflow, it seems that as long as you can put the data + parity blocks on different devices, the size of each device in the vdev could be different.

Of course if you only write full stripes (ie one block for each disk) then you would still be limited by the smallest disk, however the beauty is that you can write shorter stripes. So couldn't ZFS just spread those blocks out?

Though it's late, so maybe I missed something obvious here :)

louwrentius · on May 24, 2020

Yes, RAID 5/6 in BTRFS is not ready/stable AFAIK.

ValentineC · on May 24, 2020

Synology NASes get around this by using Btrfs with MD-RAID.

I have two such setups since 2016 or so, and they haven't given me any problems.

danShumway · on May 24, 2020

Is there a strong reason why I would want to use Raid 5/6 over Raid 10 anyway though? I guess Raid 10 is a bit more expensive, but my (naive) understanding was that everything else was almost pure upside.

louwrentius · on May 24, 2020

RAID6 is safer: any two drives can fail. RAID 10 can sustain more drive failures, but can tolerate only 'certain' drives to fail. And the drive that really should not fail is taxed during a rebuild.

Do you need the performance or capacity? What are your needs? How many drives can your chassis hold. How much space do you need?

And so on. Maybe RAID10 is the best option for you but it's not a straight rule-of-thumb.

cyphar · on May 24, 2020

> And the drive that really should not fail is taxed during a rebuild.

That is also true for RAID 5/6 (since all drives are stressed during a resilver of a RAID 5/6).

> RAID6 is safer: any two drives can fail. RAID 10 can sustain more drive failures, but can tolerate only 'certain' drives to fail

You should consider two-way mirrors as having the same redundancy as RAID5, but with faster resilver times (a byte-for-byte copy is faster than parity calculations -- and you should want your pool to be in a degraded state for as short a time as possible). You might survive more disk failures, but you shouldn't count on it.

If you want RAID6-like redundancy, use 3-way mirrors. But that's obviously more expensive than RAID6 (even though I would strongly argue that 3-way mirrors are far safer than RAID6).

barrkel · on May 25, 2020

To rebuild RAID10, only one disk needs to be read completely; to rebuild RAID6, n-1 disks.

For RAID10 to fail, two disks need to fail, and they need to be a mirrored pair, so it's a conditional probability. It's possible to lose up to n/2 disks and for the array to stay up.

For RAID6 to fail, three disks need to fail, but once three disks fail, that's it, you're out.

This all means that whether RAID6 is better than RAID10 is dependent on the number of disks and the actual failure rate. The more disks you have in your array, the more likely RAID10 is to be safer than RAID6.

RAID10 is much, much safer than RAID5. They are not similar in reliability.

Most of the time, RAID6 is safer than RAID10, but RAID10 gets you a lot better performance.

cyphar · on May 25, 2020

I didn't say that RAID5 and RAID10 are similar in terms of reliability (most of my comment said that RAID10 had many upsides over RAID5 in terms of reliability). I said they you should consider them to have the same level of redundancy -- unless you like to play Russian roulette with your data. Yeah, if you have more drives there are less bullets in the revolver but I'd prefer to not play that game in the first place. If you need a system that can survive 2 independent disk failures, use 3-way mirrors.

ggm · on May 25, 2020

Dell and the special/magic firmware.

MegaRaidCtl and 1001 --options none of them very clear.

"sorry, that drive is FOREIGN"

"sorry, I can't do that unless you drop to BIOS and do seventeen things in a crude GUI nobody understands"

reaperhulk · on May 24, 2020

Ubuntu 20.04's mdadm package doesn't appear to have a crontab entry for scrubbing actually (while 18.04 does). It does have e2scrub_all, but does that serve the same purpose?

bc4m · on May 24, 2020

No, e2scrub_all is a filesystem-level check which serves a different purpose.

mdadm scrubbing is in various states of broken across major distros. It used to be that each distro had their own cron-based scrubbing scripts. At some point mdadm introduced its own checks using systemd timers and there have been bugs upstream and in packaging.

I only noticed because I'm preparing to upgrade an old home server and carefully double checking everything. I'll probably end up grabbing one of the old cron scripts and setting it up manually. Very disappointing.

Debian/Ubuntu: https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1852747

RHEL/Centos: https://bugzilla.redhat.com/show_bug.cgi?id=1774354

reaperhulk · on May 24, 2020

Thanks for the context. Unfortunate that this silent failure is present in 20.04 with no fix so far.

mbauman · on May 24, 2020

> Especially for home users and small businesses, RAID arrays are still a reliable and efficient way of storing a lot of data in a single place.

Is this really true? These are the sorts of situations where you don’t have a dedicated IT (or dedicated know-how) to avoid the sorts of “user” problems TFA assumes are at the root of many online problems.

I know I’ve always been terrified of rebuilding arrays after failures —- and it’s not just because of the remaining drives’ reliability. It’s because the tools I’ve used were really hard to use.

louwrentius · on May 24, 2020

I should write a follow up article: you should test drive replacement / simulate failure before you start using storage.

This makes you comfortable using it and do things like drive replacements.

Products from Synology/QNAP or other brands have nice interfaces that will make things really easy for you.

Replacing the failed drive in my array was one simple command:

mdadm --add /dev/md6 /dev/sde

Then the rebuild started.

cyphar · on May 24, 2020

And similarly, in ZFS it'd be

  % zpool replace <pool> <bad-drive> <new-drive>

And you can also set up hot spares in both systems so that replacing the drive happens automatically.

One thing I would suggest is that you use /dev/disk/by-id when configuring mdadm or ZFS. Device renumbering over reboots under Linux isn't an issue for either system (they look at the metadata on disk rather than the drive name in /dev), but as an admin it helps to be able to type out the actual serial number on the drive when you're doing reboots (especially if you have the serial numbers written on the hotswap bays).

RedShift1 · on May 25, 2020

The problem is this article has a sample size of 1. This story is basically anecdotal. Come back to us once you've dealt with 100 RAID 5 setups and tell us how that worked out.

fomine3 · on May 25, 2020

This post says that mdraid bitmap solves write-hole problem but is it really? I thought that RAID journaling solves the problem.

https://raid.wiki.kernel.org/index.php/Write-intent_bitmap

https://lwn.net/Articles/665299/

jscipione · on May 24, 2020

"I ran a scrub on my 8-disk RAID 5 array (based on 2 TB drives)" Replace with RAID 1 of 2 12TB drives and most of the article won't apply to you.

dhdhhdd · on May 24, 2020

What do you mean? (I'm in the process of setting up a 2 mirrored 12tb drives)

6nf · on May 24, 2020

A mirror set of two 12TB drives will perform much better than a eight disk RAID 5 array

bc4m · on May 24, 2020

Yeah, but what's the cost? RAID5's striping with parity is a cost-saving approach compared to mirroring. If you can afford to mirror all your data that will always be the better option.

fomine3 · on May 25, 2020

8-bay NAS also costs. 2x 12TB and 2-bay NAS makes sense for speed, reliability and not too expensive.

cgijoe · on May 25, 2020

Missed an opportunity on the title there, bud. Should be: "Don't Be Af-RAID"

nikanj · on May 25, 2020

Not everything needs to be a pun.

StillBored · on May 26, 2020

I'm a pretty big user of RAID5 these days, despite a decade where I specialized in storage. There are plenty of reasons for doing RAID6+hotspares+replication (and a ton of other fancy choices), but RAID5 nails 99% of the use/problem cases in my book when managed properly. Trust me here, i've seen plenty of raid5 failures... :)

RAID5 is one of the best trade-offs in price/perf/availability you can make. Just about no one runs RAID+ on their desktop/laptop because the core storage technology is considered reliable enough that catastrophic failures are rare. Instead good backup hygiene is practiced.

For most use cases RAID5 adds a couple digits to your reliability/availability. When properly cared for (scrubbed and smart monitored) the most exciting thing that will happen is you will drop a drive every few ten thousand drive hours, and swapping a replacement in within a day or two, everything will be fine.

The cases that manage to take out a well managed RAID5 array, will likely take out just about any other mechanism as well. Having RAID6, RAID10, ZFS or whatever adds additional complexity and won't take care of firmware bugs on the disks, a batch of disks that all blow the bearings within a couple hours of each other, something going horribly wrong with the SATA/SAS controller/expander (think voltage spikes, chassis fires (well i've seen just about everything!)), viruses or bad filesystem bugs. The latter is scary high, a lot higher than most people think, and one of the reasons i'm skeptical of ZFS/BTRFS/etc.

But of course, just like your laptop/desktop you need a solid backup/recovery plan, hopefully offsite, multiply redundant and with a reasonable recovery speed. So I strongly encourage everyone to invest and maximize your backup story before you go beyond raid5.

Now, there are certain kinds of super critical data, and for those there are fancier solutions (remote replication/etc), but those solutions overwhelmingly aren't as robust as you imagine. Its the old "a plane with two engines has twice as many engine failures" mantra. The more layers of fancy filesystem, replication, deduplicatin, encryption, T10 checksuming, etc you add, the more places that can fail. Which is why you shouldn't even be thinking about them until you have a multivendor offsite (including offline) backup story that can meet you TOS numbers for recovery. It will take exactly 1 building fire to wipe out all that fancy hardware/sofeware, or at least take it offline for days. The last thing you want is the dying breath of your local storage stack to tell the remote one "ok wipe everything" or "here is write for XXXXXXXX sectors....." and trash a big chunk of the remote replicant.

Bottom line, RAID5 provides a fair amount of additional assurance against generally unlikely single drive failures. Nothing you can do on a single array/filessytem will save you from all those other bad things that can happen, so backup, backup backup!!!!

mrjin · on May 25, 2020

I've read through the article and it turns out to me that he wrote it to show he was lucky enough not to experience a multi-disk RAID failure. Personally I owned dozens of HDDs across almost all brands from desktop/laptop HDDS to Enterprise HDDs or NAS HDDs and almost half of them had bad sectors with a life span of a couple of months to a few years, some of which even broke in a row in a few days. The only exception is that in my last batch of 5 Seagate NAS drives, there was no single failure after around 4 years and I think I'm just been lucky with those as I did hear quite a few people were complaining about the same model.

Due to the nature of how HDD works, the surfaces of the disks have more or less defects leaving the factory and were graded by the count of defects, and even the enterprise grade one are not free of defects. Those defects and adjacent areas then are masked by the firmware so that magnetic heads won't trying to read/write to/from them. But that won't prevent them from growing. The factories are very clean but the dusts in the air are simply impossible to avoid and if one lands on the disk, it's a time bomb. Also as the disks' capacity grows, the tracks get narrower and narrower. The disks spins from 5400 rpm to 15000 rpm maybe even faster and the magnetic heads are gliding micros from the disk and if the heads touch the disk surface, it almost for sure to create a bad sector event bad tracks. And again, the vibrations are impossible to avoid. Making it even worse, the contact are almost guaranteed to generate debris to be scattered to the whole disk, event goes to other disks sealed in the same case. The bigger capacity, the narrower the track will be, the less tolerant to such contact. Based on the fact, I really don't see where he got the confidence the disks people bought were not going to have bad sectors during their expected life span.

Also the disks people use to build the RAID are most likely the same brand, the same model and the same batch and thus if one started to have bad sectors, most likely the others are going to have them soon or even worse may already have them. Even if the other disks in the array do not have bad sectors at the time of the failure, they are highly likely going to have during the rebuilding process, why? The load. Take a 10TB drive as an example, my Seagate 10 NAS drive can deliver ~180MB/s peak writing and ~80MB/s at the trough. Let's assume it can deliver 100MB/s all the way, and the disk is half full it will almost take 14 hours to finish, this load is highly likely to push other drives to the limit. For RAID6 allows another 1 to fail. For RAID5, if another drive failed during rebuilding, you data is gone. If you had to rebuild a 12 bays NAS, you would know how frustrating it could be, especially when there were secondary failures.

Software RAID may somehow alleviate the problem a little bit but it will also cover up the issue. My experience was that once there is one single bad sector, there will be lots of them soon, if you don't back up the data and replace the faulty drive on spot, most likely it will be too late.

All above does not event consider the setup and quirks of NAS or RAID cards. Put all those issue aside, what's are the benefits of RAID for home users? IMO, almost NIL. Our home network device most likely are Gigabyte ones with 1 Gbps ports, which can only deliver about 100MB/s, you can see that a single drive can almost saturate that bandwidth even with link aggregation. And most people does not need the random IO boost or extra large volumes come with RAID.

My suggestion for home users is to go single drive and stay away from RAID.

StillBored · on May 26, 2020

Modern disks are like cd's they have layers and layers of ECC (both spinning and flash at this point) and a certain assumed BER.

What you describe is the usual case with an unscrubbed drive. As it slowly bitrots everything appears just fine, until that day you happen to read a file on a part of the disk that hasn't been accessed in a couple years. The suddenly the controller's ECC can't correct the error and you take a read fault. Then you start listing the directory and discover that yah there are bad sectors everywhere.

So, sure you can have catastrophic failure, but the usual case with a well scrubbed drive is a corrected/relocated error once in a while. As long as they remain intermittent your fine, if suddenly you get 100 errors on one scrub and then the next one gets another 100, you better replace that disk fast. Or the other case is that you go from just fine, to pretty much nothing can be read in a single scrub cycle. You don't really have a choice in that case, the controller will likely just kick it out.