I follow the MLX team on Twitter and they sometimes post about using MLX on two ...

awnihannun · 2025-12-12T22:23:55 1765578235

For a bit more context, those posts are using pipeline parallelism. For N machines put the first L/N layers on machine 1, next L/N layers on machine 2, etc. With pipeline parallelism you don't get a speedup over one machine - it just buys you the ability to use larger models than you can fit on a single machine.

The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.

dpe82 · 2025-12-13T00:24:22 1765585462

> The main challenge is latency since you have to do much more frequent communication.

Earlier this year I experimented with building a cluster to do tensor parallelism across large cache CPUs (AMD EPYC 7773X have 768mb of L3). My thought was to keep an entire model in SRAM and take advantage of the crazy memory bandwidth between CPU cores and their cache, and use Infiniband between nodes for the scatter/gather operations.

Turns out the sum of intra-core latency and PCIe latency absolutely dominate. The Infiniband fabric is damn fast once you get data to it, but getting it there quickly is a struggle. CXL would help but I didn't have the budget for newer hardware. Perhaps modern Apple hardware is better for this than x86 stuff.

wmf · 2025-12-13T01:50:00 1765590600

That's how Groq works. A cluster of LPUv2s would probably be faster and cheaper than an Infiniband cluster of Epycs.

dpe82 · 2025-12-13T09:33:16 1765618396

Yeah I'm familiar; I was hoping I could do something related on previous generation commodity(ish) hardware. It didn't work but I learned a ton.

fooblaster · 2025-12-13T04:36:43 1765600603

what is an lpuv2

wmf · 2025-12-13T04:51:47 1765601507

The chip that Groq makes.

aimanbenbaha · 2025-12-13T13:25:13 1765632313

Exo-Labs is an open source project that allows this too, pipeline parallelism I mean not the latter, and it's device agnostic meaning you can daisy-chain anything you have that has memory and the implementation will intelligently shard model layers across them, though its slow but scales linearly with concurrent requests.

Exo-Labs: https://github.com/exo-explore/exo

liuliu · 2025-12-12T22:51:55 1765579915

But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do KV lookup on shards, not sure how much speed-up that will be though).

zackangelo · 2025-12-12T23:00:10 1765580410

No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)

liuliu · 2025-12-12T23:25:42 1765581942

I usually call it "head parallelism" (which is a type of tensor parallelism, but paralllelize for small clusters, and specific to attention). That is what you described: sharding input tensor by number of heads and send to respective Q, K, V shard. They can do Q / K / V projection, rope, qk norm whatever and attention all inside that particular shard. The out projection will be done in that shard too but then need to all reduce sum amongst shard to get the final out projection broadcasted to every participating shard, then carry on to do whatever else themselves.

I am asking, however, is whether that will speed up decoding as linearly as it would for prefilling.

awnihannun · 2025-12-13T00:33:43 1765586023

Right, my comment was mostly about decoding speed. For prefill you can get a speed up but there you are less latency bound.

In our benchmarks with MLX / mlx-lm it's as much as 3.5x for token generation (decoding) at batch size 1 over 4 machines. In that case you are memory bandwidth bound so sharding the model and KV cache 4-ways means each machine only needs to access 1/4th as much memory.

liuliu · 2025-12-13T01:20:40 1765588840

Oh! That's great to hear. Congrats! Now, I want to get the all-to-all primitives ready in s4nnc...

monster_truck · 2025-12-12T23:25:30 1765581930

Even if it wasn't outright beneficial for decoding by itself, it would still allow you to connect a second machine running a smaller, more heavily quantized version of the model for speculative decoding which can net you >4x without quality loss

anemll · 2025-12-13T03:51:59 1765597919

Tensor Parallel test with RDMA last week https://x.com/anemll/status/1996349871260107102

Note fast sync workaround

andy99 · 2025-12-12T22:25:11 1765578311

I’m hoping this isn’t as attractive as it sounds for non-hobbyists because the performance won’t scale well to parallel workloads or even context processing, where parallelism can be better used.

Hopefully this makes it really nice for people that want the experiment with LLMs and have a local model but means well funded companies won’t have any reason to grab them all vs GPUs.

api · 2025-12-12T23:31:23 1765582283

No way buying a bunch of minis could be as efficient as much denser GPU racks. You have to consider all the logistics and power draw, and high end nVidia stuff and probably even AMD stuff is faster than M series GPUs.

What this does offer is a good alternative to GPUs for smaller scale use and research. At small scale it’s probably competitive.

Apple wants to dominate the pro and serious amateur niches. Feels like they’re realizing that local LLMs and AI research is part of that, is the kind of thing end users would want big machines to do.

gumboshoes · 2025-12-12T23:59:18 1765583958

Exactly: The AI appliance market. A new kind of home or small-business server.

jabbywocker · 2025-12-13T00:30:49 1765585849

I’m expecting Apple to release a new Mac Pro in the next couple years who’s main marketing angle is exactly this

firecall · 2025-12-13T00:48:20 1765586900

Seems like it could be a thing.

Also, I’m curious and in case anyone that knows reads this comment:

Apple say they can’t get the performance they want out of discreet GPUs.

Fair enough. But yet nVidia becomes the most valuable company in the world selling GPUs.

So…

Now I get that Apples use case is essentially sealed consumer devices built with power consumption and performance tradeoffs in mind.

But could Apple use its Apple Silicon tech to build a Mac Pro with its own expandable GPU options?

Or even other brand GPUs knowing they would be used for AI research etc…. If Apple ever make friends with nVidia again of course :-/

What we know of Tim Cooks Apple is that it doesn’t like to leave money on the table, and clearly they are right now!

jabbywocker · 2025-12-13T01:02:41 1765587761

There’s been rumors of Apple working on M-chips that have the GPU and CPU as discrete chiplets. The original rumor said this would happen with the M5 Pro, so it’s potentially on the roadmap.

Theoretically they could farm out the GPU to another company but it seems like they’re set on owning all of the hardware designs.

nntwozz · 2025-12-13T03:19:36 1765595976

Apple always strives for complete vertical integration.

SJ loved to quote Alan Kay:

"People who are really serious about software should make their own hardware."

Qualcomm are the latest on the chopping block, history repeating itself.

If I were a betting man I'd say Apple's never going back.

jabbywocker · 2025-12-14T01:54:26 1765677266

Yeah outside of TSMC, I don’t see them ever going back to having a hardware partner.

storus · 2025-12-13T17:57:06 1765648626

TSMC has a new tech that allows seamless integration of mini chiplets, i.e. you can add as many CPU/GPU cores in mini chiplets as you wish and glue them seamlessly together, at least in theory. The rumor is that TSMC had some issues with it which is why M5P and M5M are delayed.

api · 2025-12-13T13:57:33 1765634253

It’s really the only common reason to buy a machine that big these days. I could see a Mac Pro with a huge GPU and up to a terabyte of RAM.

I guess there are other kinds of scientific simulation, very large dev work, and etc., but those things are quite a bit more niche.

alwillis · 2025-12-13T18:32:17 1765650737

> I’m expecting Apple to release a new Mac Pro in the next couple years

I think Apple is done with expansion slots, etc.

You'll likely see M5 Mac Studios fairly soon.

jabbywocker · 2025-12-14T02:02:12 1765677732

I’m not saying a Mac Pro with expansion slots, I’m saying a Mac Pro whose marketing angle is locally running AI models. A hungry market that would accept moderate performance and is already used to bloated price tags has to have them salivating.

I think the hold up here is whether TSMC can actually deliver the M5 Pro/Ultra and whether the MLX team can give them a usable platform.

pjmlp · 2025-12-13T15:22:45 1765639365

I fear they no longer care about the workstation market, even the folks at ATP Podcast are at the verge of accepting it.

FuckButtons · 2025-12-13T04:36:46 1765600606

Power draw? A entire Mac Pro running flat out uses less power than 1 5090. If you have a workload that needs a huge memory footprint then the tco of the Macs, even with their markup may be lower.

codazoda · 2025-12-12T22:41:26 1765579286

I haven’t looked yet but I might be a candidate for something like this, maybe. I’m RAM constrained and, to a lesser extent, CPU constrained. It would be nice to offload some of that. That said, I don’t think I would buy a cluster of Macs for that. I’d probably buy a machine that can take a GPU.

ChrisMarshallNY · 2025-12-13T09:26:14 1765617974

I’m not particularly interested in training models, but it would be nice to have eGPUs again. When Apple Silicon came out, support for them dried up. I sold my old BlackMagic eGPU.

That said, the need for them also faded. The new chips have performance every bit as good as the eGPU-enhanced Intel chips.

andy_ppp · 2025-12-13T14:36:15 1765636575

eGPU with an Apple accelerator with a bunch or RAM and GPU cores could be really interesting honestly. I’m pretty sure they are capable of designing something very competitive especially in terms of performance per watt.

sroussey · 2025-12-13T23:36:53 1765669013

Really, that’s a place for the MacPro: slide in SoC with ram modules / blades. Put 4, 8, 16 Ultra chips in one machine.

andy_ppp · 2025-12-14T11:39:02 1765712342

You honestly don’t need extra CPUs in this system at some point do you?

sroussey · 2025-12-14T16:10:57 1765728657

They are inseparable for Apple. CPUS/GPUs/memory. They can use chipsets to tweak ratios, but I doubt they will change the underlying module format—everything together.

My suggestion is to accept that format and just provide a way to network them at a low level via pci or better.

willtemperley · 2025-12-13T02:44:24 1765593864

I think it’s going to be great for smaller shops that want on premise private cloud. I’m hoping this will be a win for in-memory analytics on macOS.

bigyabai · 2025-12-12T23:21:11 1765581671

The lack of official Linux/BSD support is enough to make it DOA for any serious large-scale deployment. Until Apple figures out what they're doing on that front, you've got nothing to worry about.

mjlee · 2025-12-13T12:13:46 1765628026

Why? AWS manages to do it (https://aws.amazon.com/ec2/instance-types/mac/). Smaller companies too - https://macstadium.com

Having used both professionally, once you understand how to drive Apple's MDM, Mac OS is as easy to sysadmin as Linux. I'll grant you it's a steep learning curve, but so is Linux/BSD if you're coming at it fresh.

In certain ways it's easier - if you buy a device through Apple Business you can have it so that you (or someone working in a remote location) can take it out of the shrink wrap, connect it to the internet, and get a configured and managed device automatically. No PXE boot, no disk imaging, no having it shipped to you to configure and ship out again. If you've done it properly the user can't interrupt/corrupt the process.

The only thing they're really missing is an iLo, I can imagine how AWS solved that, but I'd love to know.

bigyabai · 2025-12-14T02:54:17 1765680857

Where the in the world are you working where MDM is the limiting factor on Linux deployments? North Korea?

Macs are a minority in the datacenter even compared to Windows server. The concept of a datacenter Mac would disappear completely if Apple let free OSes sign macOS/iOS apps.

mjlee · 2025-12-14T09:15:44 1765703744

I’m talking about using MDM with Mac OS (to take advantage of Apple Silicon, not licensing) in contrast to the tools we already have with other OSes. Probably you could do it to achieve a large scale on prem Linux deployment, fortunately I’ve never tried.

bigyabai · 2025-12-15T17:43:33 1765820613

Well, be that as it may, it's quite unrelated to deploying Macs in the datacenter. It's definitely not a selling point to people putting Proxmox or k8s on their machines.

Eggpants · 2025-12-13T00:18:59 1765585139

Not sure I understand, Mac OS is BSD based. https://en.wikipedia.org/wiki/Darwin_(operating_system)

bigyabai · 2025-12-13T00:47:34 1765586854

macOS is XNU-based. There is BSD code that runs in the microkernel level and BSD tools in the userland, but the kernel does not resemble BSD's architecture or adopt BSD's license.

This is an issue for some industry-standard software like CUDA, which does provide BSD drivers with ARM support that just never get adopted by Apple: https://www.nvidia.com/en-us/drivers/unix/

7e · 2025-12-13T02:25:35 1765592735

If there were TCO advantages with this setup, CUDA would not be a blocker.

bigyabai · 2025-12-13T04:39:53 1765600793

CUDA's just one example; there's a lot of hardware support on the BSDs that Apple doesn't want to inherit.

ngcc_hk · 2025-12-13T05:54:34 1765605274

Why maint other and have baggage ?

bigyabai · 2025-12-13T18:08:10 1765649290

Because Apple already does...? There's still PowerPC and MIPS code that runs in macOS. Asking for CUDA compatibility is not somehow too hard for the trillion-dollar megacorp to handle.

CamperBob2 · 2025-12-13T02:09:56 1765591796

Almost the most impressive thing about that is the power consumption. ~50 watts for both of them? Am I reading it wrong?

wmf · 2025-12-13T04:07:06 1765598826

Yeah, two Mac Studios is going to be ~400 W.

CamperBob2 · 2025-12-13T04:15:29 1765599329

What am I missing? https://i.imgur.com/YpcnlCH.png

(Edit: interesting, thanks. So the underlying OS APIs that supply the power-consumption figures reported by asitop are just outright broken. The discrepancy is far too large to chalk up to static power losses or die-specific calibration factors that the video talks about.)

wmf · 2025-12-13T04:51:05 1765601465

https://www.youtube.com/watch?v=zCkbVLqUedg

m-s-y · 2025-12-13T05:22:42 1765603362

Can confirm. My M3 Ultra tops out at 210W when ComfyUI or ollama is running flat out. Confirmed via smart plug.