Api-gateway -> Lambda -> NATS -> docker backend on GPU
The backend is C++ and needs to be shielded as much as possible from the web. It also takes a fair wedge of time to deal with requests (1-5 seconds) THe rationale is the Lambda front end validates and routes, the backend replies.
Before we moved to NATs we had the added bonus that using SQS meant that there was no possible link between front-backend apart from SQS. (both lambda and SQS are outside of the VPC by default)
the backend has a strict schema, and to keep moving parts down it seemed wise to avoid having to plumb in a webserver as well. They really are badly suited to this sort of thing. They also add a boat load of extra latency.
We have a large number of machines, all of which deal with certain areas, This allows us to intelligently cache hotter areas closer to clients.
Request reply is a reasonable pattern for interacting with backends that are constantly changing capacity based on demand. It also, if done correctly can be a way to optimise for latency over capacity (depending on how you do it.)
With NATS you can ask for more than one response, which might also be useful. Some other systems use a "fastest reply wins" which guarantees a fast response at the expense of capacity.
Why not stick with a load balancer? I've been struggling to find a good use case for request-response over message queues so I'm curious.