If I understand correctly, a big problem is that the calculation isn't embarrasingly parallel: the various chunks are not independent, so you need to do a lot of IO to get the results from step N from your neighbours to calculate step N+1.
Using more smaller nodes means your cross-node IO is going to explode. You might save money on your compute hardware, but I wouldn't be surprised if you'd end up with an even greater cost increase on the network hardware side.
I'm not sure where else you can get a half TB of 800GB/s memory for < $10k. (Though that's the M3 Ultra, don't know about the M5). Is there something competitive in the nvidia ecosystem?
I wasn't aware that M3 Ultra offered a half terabyte of unified memory, but an RTX5090 has double that bandwidth and that's before we even get into B200 (~8TB/s).
You could get x1 M3 Ultra w/ 512gb of unified ram for the price of x2 RTX 5090 totaling 64gb of vram not including the cost of a rig capable of utilizing x2 RTX 5090.
I don't think I can recommend the Mac Studio for AI inference until the M5 comes out. And even then, it remains to be seen how fast those GPUs are or if we even get an Ultra chip at all.