Wasn't the point why AWS is so much premium that you will always get at least 6 ...

torginus · 2025-10-20T07:58:11 1760947091

They guarantee the dashboard will be green 99.999% of the time

Ekaros · 2025-10-20T07:58:44 1760947124

I take dashboard is not covered by SLA?

theshrike79 · 2025-10-20T08:12:17 1760947937

The dashboard is the SLA.

IIRC it takes WAY too many managers to approve the dashboard being anything other than green.

It's not a reflection of reality nor is it automated.

Nextgrid · 2025-10-20T10:24:17 1760955857

The point of AWS is to promise you the nines and make you feel good about it. Your typical "growth & engagement" startup CEO can feel good and make his own customers feel good about how his startup will survive a nuclear war.

Delivery of those nines is not a priority. Not for the cloud provider - because they can just lie their way out of it by not updating their status page - and even when they don't, they merely have to forego some of their insane profit margin for a couple hours in compensation. No provider will actually put their ass on the line and offer you anything beyond their own profit margin.

This is not an issue for most cloud clients either because they keep putting up with it (lying on the status page wouldn't be a thing if clients cared) - the unspoken truth is that nobody cares that your "growth & engagement" thing is down for an hour or so, so nobody makes anything more than a ceremonial stink about it (chances are, the thing goes down/misbehaves regularly anyway every time the new JS vibecoder or "AI employee" deploys something, regardless of cloud reliability).

Things where nines actually matter will generally invest in self-managed disaster recovery plans that are regularly tested. This also means it will generally be built differently and far away from your typical "cloud native" dumpster fire. Depending on how many nines you actually need (aka what's the cost of not meeting that target - which directly controls how much budget you have to ensure you always meet it), you might be building something closer to aircraft avionics with the same development practices, testing and rigor.

mightyham · 2025-10-21T21:01:08 1761080468

I can tell you from personal experience that improving/maintaining uptime (by doing root cause analysis, writing correction of error reports, going through application security reviews, writing/reviewing design docs for safely deploying changes, working on operational improvements to services) probably takes up a majority of most AWS engineers' time. I'm genuinely curious what you are basing the opinion "Delivery of those nines is not a priority" off of.

Nextgrid · 2025-10-22T15:11:06 1761145866

> what you are basing the opinion "Delivery of those nines is not a priority" off of.

Because I don't see the business pressure to do? If problems happen they can 1) lie on the status page and hope nothing happens and 2) if they can't get away with lying, their downside is limited to a few hours of profit margin.

(which is not really a dig at AWS because no hosting provider will put their business on the line for you... it's more of a dig at people who claim AWS is some uptime unicorn while in reality they're nowhere near better than your usual hosting provider to justify their 1000x markup)

It's great if they're doing their best anyway, but I don't see it as anything more than "best effort", because nothing bad would happen even if they didn't do a good job at it.

esskay · 2025-10-20T08:02:38 1760947358

It's usually true if you arent in US-East-1 which is widely known to be the least reliable location. Theres no reason anyone should be deploying anything new to it these days.

miohtama · 2025-10-20T07:57:46 1760947066

1. Competitors are not any better, or worse

2. Trusted brand

that_guy_iain · 2025-10-20T08:10:43 1760947843

You can. You just need to do the work to make it work. That's the bit where everyone but Netflix and Amazon fail.

Unroasted6154 · 2025-10-20T08:09:40 1760947780

You are supposed to build multi regional services if you need higher resilience.

Nextgrid · 2025-10-20T10:49:56 1760957396

Actual multi-region replication is hard and forces you to think about complicated things like the CAP theorem/etc. It's easier to pretend AWS magically solves that problem for you.

Which is actually totally fine for the vast majority of things, otherwise there would be actual commercial pressures to make sure systems are resilient to such outages.

hylaride · 2025-10-20T11:40:26 1760960426

You could also achieve this in practice by just not using us-east-1, though at the very least you should have another region going for DR.

Unroasted6154 · 2025-10-20T11:10:49 1760958649

Never said it was easy or desirable for most companies.

But there is only so much a cloud provider can guarantee within a region or whatever unit of isolation they offer.

shawabawa3 · 2025-10-20T08:16:48 1760948208

the highest availability service i think is S3 at 4 nines

you might be thinking of durability for s3 which is 11 nines, and i've never heard of anyone losing an object yet

alex_suzuki · 2025-10-20T08:40:36 1760949636

no, it's probably Route 53, touted as having "100% availability" (https://en.wikipedia.org/wiki/Amazon_Route_53)

shawabawa3 · 2025-10-20T09:07:00 1760951220

hah, that's funny as the outage seems to be caused by DNS issues

kondro · 2025-10-20T18:01:11 1760983271

Route53 was still resolving DNS entries just fine. But it looked like someone/something removed the entries for DynamoDB.

crbaker · 2025-10-20T07:58:23 1760947103

us-east-1 the worst region for availability

abujazar · 2025-10-20T07:57:53 1760947073

Last time I checked the standard SLA is actually 99 % and the only compensation you get for downtime is a refund. Which is why I don't use AWS for anything mission critical.

rustc · 2025-10-20T08:02:22 1760947342

Does any host provide more compensation than refund for downtime?

TheDong · 2025-10-20T09:16:34 1760951794

https://mail.tarsnap.com/tarsnap-announce/msg00050.html

> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, I credited everyone's Tarsnap accounts with 50% of a month's storage costs.

So in this case the downtime was roughly 26 hours, and the refund was for 50% of a month, so that's more than a 1-1 downtime refund.

esskay · 2025-10-20T08:46:48 1760950008

Most "legacy" hosts do yes. The norm used to be a percentage of your bill for every hour of downtime once uptime dropped below 99.9%. If the outage was big enough you'd get credit exceeding your bill, and many would allow credit withdrawal in those circumstances. There were still limits to protect the host but there was a much better SLA in place.

Cloud providers just never adopted that and the "ha, sucks to be you" mentality they have became the norm.

abujazar · 2025-10-20T08:10:28 1760947828

Depends on which service you're paying for. For pure hosting the answer is no, which is why it rarely makes sense to go AWS for uptime and stability because when it goes down there's nothing you can do. As opposed to bare metal hosting with redundancy across data centers, which can even cost less than AWS for a lot of common workloads.

systemvoltage · 2025-10-20T08:00:00 1760947200

What do you do if not AWS?

esskay · 2025-10-20T08:48:28 1760950108

Theres literally thousands of options. 99% of people on AWS do not need to be on AWS. VPS servers or load balanced cloud instances from providers like Hetzner are more than enough for most people.

It still baffles me how we ended up in this situation where you can almost hear peoples disapproval over the internet when you say AWS / Cloud isn't needed and you're throwing money away for no reason.

Nextgrid · 2025-10-20T10:30:30 1760956230

There's nothing particularly wrong with AWS, other than the pricing premium.

The key is that you need to understand no provider will actually put their ass on the line and compensate you for anything beyond their own profit margin, and plan accordingly.

For most companies, doing nothing is absolutely fine, they just need to plan for and accept the occasional downtime. Every company CEO wants to feel like their thing is mission-critical but the truth is that despite everything being down the whole thing will be forgotten in a week.

For those that actually do need guaranteed uptime, they need to build it themselves using a mixture of providers and test it regularly. They should be responsible for it themselves, because the providers will not. The stuff that is actually mission-critical already does that, which is why it didn't go down.

abujazar · 2025-10-20T08:06:30 1760947590

Been using AWS too, but for a critical service we mirrored across three Hetzner datacenters with master-master replication as well as two additional locations for cluster node voting.