So there is access to "degraded functionality" from start (the "3-15" of "degraded functionality" one) - people are asking why not share THAT then?
Nobody cares about internal escalations, if manager is taking shit or not - that's not service status, that's internal dealing with the shit process - it can surface as extra timestamped comments next to service STATUS.
When you've guaranteed 4 or 5 nines worth of uptime to the customer, every acknowledged outage results in refunds (and potentially being sued over breach of contract)
Meh, I’ve never seen an uptime (SLA) guarantee that was worth anything anyway. They’re consistently toothless, publicly-offered ones anyway (can’t comment on privately-negotiated ones). I’ve written about it a few times, with a couple of specific examples: https://hn.algolia.com/?type=comment&query=sla+chrismorgan.
But not acknowledging actual outages, yeah, that would open you up to accusations of fraud, which is probably in theory much more serious.
Because the systems are so complex and capable of emergent behavior that you need a human in the loop to truly interpret behavior and impact. Just because an alert is going off doesn't mean that the alert was written properly, or is measuring the correct thing, or the customer is interpreting its meaning correctly, etc.
Health probes are at the easiest side of software complexity spectrum. It has nothing to do with it and everything with managing reputational damage in shady way.
Nobody cares about internal escalations, if manager is taking shit or not - that's not service status, that's internal dealing with the shit process - it can surface as extra timestamped comments next to service STATUS.