The point of AWS is to promise you the nines and make you feel good about it. Your typical "growth & engagement" startup CEO can feel good and make his own customers feel good about how his startup will survive a nuclear war.
Delivery of those nines is not a priority. Not for the cloud provider - because they can just lie their way out of it by not updating their status page - and even when they don't, they merely have to forego some of their insane profit margin for a couple hours in compensation. No provider will actually put their ass on the line and offer you anything beyond their own profit margin.
This is not an issue for most cloud clients either because they keep putting up with it (lying on the status page wouldn't be a thing if clients cared) - the unspoken truth is that nobody cares that your "growth & engagement" thing is down for an hour or so, so nobody makes anything more than a ceremonial stink about it (chances are, the thing goes down/misbehaves regularly anyway every time the new JS vibecoder or "AI employee" deploys something, regardless of cloud reliability).
Things where nines actually matter will generally invest in self-managed disaster recovery plans that are regularly tested. This also means it will generally be built differently and far away from your typical "cloud native" dumpster fire. Depending on how many nines you actually need (aka what's the cost of not meeting that target - which directly controls how much budget you have to ensure you always meet it), you might be building something closer to aircraft avionics with the same development practices, testing and rigor.
I can tell you from personal experience that improving/maintaining uptime (by doing root cause analysis, writing correction of error reports, going through application security reviews, writing/reviewing design docs for safely deploying changes, working on operational improvements to services) probably takes up a majority of most AWS engineers' time. I'm genuinely curious what you are basing the opinion "Delivery of those nines is not a priority" off of.
> what you are basing the opinion "Delivery of those nines is not a priority" off of.
Because I don't see the business pressure to do? If problems happen they can 1) lie on the status page and hope nothing happens and 2) if they can't get away with lying, their downside is limited to a few hours of profit margin.
(which is not really a dig at AWS because no hosting provider will put their business on the line for you... it's more of a dig at people who claim AWS is some uptime unicorn while in reality they're nowhere near better than your usual hosting provider to justify their 1000x markup)
It's great if they're doing their best anyway, but I don't see it as anything more than "best effort", because nothing bad would happen even if they didn't do a good job at it.
It's usually true if you arent in US-East-1 which is widely known to be the least reliable location. Theres no reason anyone should be deploying anything new to it these days.
Actual multi-region replication is hard and forces you to think about complicated things like the CAP theorem/etc. It's easier to pretend AWS magically solves that problem for you.
Which is actually totally fine for the vast majority of things, otherwise there would be actual commercial pressures to make sure systems are resilient to such outages.
Last time I checked the standard SLA is actually 99 % and the only compensation you get for downtime is a refund. Which is why I don't use AWS for anything mission critical.
> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people
credits for outages when it seems fair" policy, I credited everyone's Tarsnap accounts with 50% of a month's storage costs.
So in this case the downtime was roughly 26 hours, and the refund was for 50% of a month, so that's more than a 1-1 downtime refund.
Most "legacy" hosts do yes. The norm used to be a percentage of your bill for every hour of downtime once uptime dropped below 99.9%. If the outage was big enough you'd get credit exceeding your bill, and many would allow credit withdrawal in those circumstances. There were still limits to protect the host but there was a much better SLA in place.
Cloud providers just never adopted that and the "ha, sucks to be you" mentality they have became the norm.
Depends on which service you're paying for. For pure hosting the answer is no, which is why it rarely makes sense to go AWS for uptime and stability because when it goes down there's nothing you can do. As opposed to bare metal hosting with redundancy across data centers, which can even cost less than AWS for a lot of common workloads.
Theres literally thousands of options. 99% of people on AWS do not need to be on AWS. VPS servers or load balanced cloud instances from providers like Hetzner are more than enough for most people.
It still baffles me how we ended up in this situation where you can almost hear peoples disapproval over the internet when you say AWS / Cloud isn't needed and you're throwing money away for no reason.
There's nothing particularly wrong with AWS, other than the pricing premium.
The key is that you need to understand no provider will actually put their ass on the line and compensate you for anything beyond their own profit margin, and plan accordingly.
For most companies, doing nothing is absolutely fine, they just need to plan for and accept the occasional downtime. Every company CEO wants to feel like their thing is mission-critical but the truth is that despite everything being down the whole thing will be forgotten in a week.
For those that actually do need guaranteed uptime, they need to build it themselves using a mixture of providers and test it regularly. They should be responsible for it themselves, because the providers will not. The stuff that is actually mission-critical already does that, which is why it didn't go down.
Been using AWS too, but for a critical service we mirrored across three Hetzner datacenters with master-master replication as well as two additional locations for cluster node voting.