If I were to summarize how we attacked this when I was on AWS (different team)
formal methods. Some of this started a long time ago so not sure if it was TLA, TLA+, or something else. (I am a useless manager type)
fake clients / servers to make testing possible
strict invariants
A simulator to fuzz/fault the entire system. We didn't get this until later in the life of the service but flushed out race condition bugs that would have taken years to do.
We never got to replaying customer traffic patterns which was a pet idea of mine but probably the juice wasn't worth the squeeze.
formal methods. Some of this started a long time ago so not sure if it was TLA, TLA+, or something else. (I am a useless manager type)
fake clients / servers to make testing possible
strict invariants
A simulator to fuzz/fault the entire system. We didn't get this until later in the life of the service but flushed out race condition bugs that would have taken years to do.
We never got to replaying customer traffic patterns which was a pet idea of mine but probably the juice wasn't worth the squeeze.