To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.
To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.