I agree and think it also needs to be pointed out that "traditional hardware RAI...

ajsnigrutin · on May 25, 2020

But... nobody ever got fired for choosing IBM!

If you set up "your own solution", you're always the "guilty one" if something fails. If you buy IBM (or HP or whatever), it's their fault if something fails.

(i did some of that stuff in enterprise enviroments.... honestly, IBM/HP/... + paid support was always the better option for me, mainly due to "liability" issues)

zrm · on May 25, 2020

"Nobody ever got fired for buying IBM." -IBM, just before PC servers ate their lunch.

My experience has always been that it's better to do your homework so you don't have problems than to have a scapegoat to blame the problems on.

And the extent to which it's "their fault" tends to be quite inadequate. If you build a 16-drive RAID5 on some enterprise vendor's equipment and then suffer a multiple disk failure during a rebuild because that was a Bad Idea, what's the support contract going to get you? An overnight replacement for both failed drives? All your data is still gone.

ajsnigrutin · on May 25, 2020

> If you build a 16-drive RAID5 on some enterprise vendor's equipment and then suffer a multiple disk failure during a rebuild because that was a Bad Idea...

If I do that with a linux/solaris/mdadm/zfs/btrfs/whatever, and it fails, who will be the one blamed for the failure? Me of course.

If i select a propper raid setup, and it still failes, it's better if it fails on an IBM than on my duct-tape linux/mdadm/zfs/btrfs/whatever solution, because again, I personally can say, that i did what the vendor (IBM) recommended in their brochure, and their solution failed. If the duct-tape solution fails, there's noone else to blame, than me.

Either way, you have to restore from backup.

And if there is some random bug (eg. kernel oops when mounting volume), there's IBM support on the call. Working with standard options, means standard, well-known bugs, and that means that support can help fix them.

I know, that I (personally) sleep better, knowing there is a professional solution from a big-brand running in a "boring, enterprise" production.

zrm · on May 25, 2020

> If I do that with a linux/solaris/mdadm/zfs/btrfs/whatever, and it fails, who will be the one blamed for the failure? Me of course.

So don't create a 16-drive RAID5 at all.

> If i select a propper raid setup, and it still failes, it's better if it fails on an IBM than on my duct-tape linux/mdadm/zfs/btrfs/whatever solution, because again, I personally can say, that i did what the vendor (IBM) recommended in their brochure, and their solution failed.

But you're the one who chose the vendor, and evaluated their solution, and accepted it. So you're still fired.

> Either way, you have to restore from backup.

The point is there's a third option. Use a level of redundancy appropriate to the importance of the system, and then it doesn't fail to begin with and it isn't necessary to divert blame. Meanwhile you get to show your boss how you saved the company millions of dollars.

> And if there is some random bug (eg. kernel oops when mounting volume), there's IBM support on the call. Working with standard options, means standard, well-known bugs, and that means that support can help fix them.

Which you can still get for linux/solaris/mdadm/zfs/btrfs/whatever from a variety of vendors if you really want it. But because they're generic technologies, this also means you can get that support from multiple competing providers (so the cost is lower), including per-incident support once something happens that requires it instead of having to buy a support contract you may not ever need.

Which you generally won't need, because the likes of mdadm are more widely used (and so better tested) than any individual vendor's proprietary hardware RAID solution, and can also recover more easily from things like controller failures because the replacement controller needs no state and is not required to match in firmware version or even model.

1bc29b36f623ba8 · on May 25, 2020

The grandparent does have a point when it comes to assigning blame; when you have a hardware-based solution then one can always blame the hardware manufacturer. If you use an open-source solution then there is typically no warranties, and then blame therefore falls upon whomever implemented the solution.

Management likes accountability so the above isn't going to change any time soon.

Now, obviously, the way around is to use enterprise-level solutions based on software. It is often more expensive than the hardware route, but as you say, there are plenty of vendors who do offer such solutions with a proper SLA.

wodenokoto · on May 25, 2020

> But you're the one who chose the vendor, and evaluated their solution, and accepted it. So you're still fired.

No, that is the whole point of the "Nobody got fired for buying IBM" meme. If you buy IBM and it fails, even your corporate, non-tech boss knows (or believes) that you couldn't have done any better than what you did.

If you build your own solution or buy from a "lesser" vendor, when things fail, you will be questioned about the solution.

> The point is there's a third option. Use a level of redundancy appropriate to the importance of the system, and then it doesn't fail to begin with and it isn't necessary to divert blame.

That doesn't make any sense. The hypothetical here is that we are comparing the failure of the best homebrew solution to the failure of the appropriate big corporation solution.

It doesn't make sense to say that the solution to failures is to not have any.

zrm · on May 25, 2020

> No, that is the whole point of the "Nobody got fired for buying IBM" meme. If you buy IBM and it fails, even your corporate, non-tech boss knows (or believes) that you couldn't have done any better than what you did.

"Nobody got fired for buying IBM" isn't a meme, it's an IBM marketing slogan. In real life your boss cares a lot more about whether something failed on your watch than why.

> It doesn't make sense to say that the solution to failures is to not have any.

It does when the open source solutions have a lower failure rate because they're more widely used, better tested and less "integrated" which dramatically increases flexibility in quickly implementing solutions to unforeseen problems. It also leaves more budget for real improvements, because RAID6 with a hot spare on commodity hardware costs less than RAID5 with no spare from "enterprise" vendors.

Reliability engineering is science. Hardware will fail. Software will fail. So you include redundancy for the hardware with the highest failure rates, use more redundancy for systems with higher importance, and use modular widely-available commodity parts that can be quickly replaced and sourced from multiple independent vendors. For systems with even higher availability requirements, use system-level redundancy and keep regular snapshots on access-restricted independent backup systems, etc.

Money spent actually preventing failures is more effective than money spent to acquire a scapegoat.

newacct583 · on May 25, 2020

Still, those enterprise junior executives didn't get fired for choosing mainframes! Their companies made bad decisions in hindsight, but they (and we) blame those bad choices on IBM having failed to meet the needs of the market, not on whatever random Fortune 500 companies bought S/390's instead of PC servers.

Which is the point. IBM was the conservative choice, and they remained so even past the point where they were the wrong choice. If you're a junior executive trying to climb the ladder, that kind of "decision security" has real value.

ajsnigrutin · on May 25, 2020

> Still, those enterprise junior executives didn't get fired for choosing mainframes!

This!

> Their companies made bad decisions in hindsight, but they (and we) blame those bad choices on IBM having failed to meet the needs of the market, not on whatever random Fortune 500 companies bought S/390's instead of PC servers.

Those companies didn't make bad decisions... the mainframes worked. The might have been more expensive, and slower than a bunch of PC servers, but they did the job. We now know, in hindsight, that PC servers were a better choice, but which one of us is willing to risk our own money/job for something that is currently something new?

There was even a time when tablet computers (think ipad, not x series thinkpads) were "the future of computing", and that "they would replace the need for PCs for everyone"... now (in hindsight) we know that tablets are useless for most people (except for consuming entertainment), and that work computers are still boring PCs with keyboards and mice.

emsign · on May 25, 2020

That's funny I use an IBM rebranded RAID card. In AHCI mode with ZFS.