> necessary for fast enough transfer speeds Source?

stu2b50 · on Oct 28, 2024

When was the last time you saw a GPU with slottable memory?

For transfer speeds, look at the data sheets for the M series. Much faster than DDR4 or DDR5 RAM. In the ballpark of GPU memory.

Wytwwww · on Oct 28, 2024

Would the people who were buying the baseline 8GB model (presumably just for general computing/office work) care about the GPU being slightly slower, though?

I bet that the extreme lag when you run out of memory because you have an Electron app or two, several browser tabs and something like Excel is way more noticeable.

Hardly anyone is using Macs for gaming these days and almost anybody who does something GPU intense would need more than 16GB anyway.

ankleturtle · on Oct 28, 2024

This has been the approach since the M1s.

See: https://www.theregister.com/2020/11/19/apple_m1_high_bandwid...

> The SoC has access to 16GB of unified memory. This uses 4266 MT/s LPDDR4X SDRAM (synchronous DRAM) and is mounted with the SoC using a system-in-package (SiP) design. A SoC is built from a single semiconductor die whereas a SiP connects two or more semiconductor dies.

KoolKat23 · on Oct 28, 2024

https://www.apple.com/ie/newsroom/2020/11/apple-unleashes-m1...

aseipp · on Oct 28, 2024

Source for what? Parallel RAM interfaces have strict timing and electrical requirements. Classic DDR sockets are modular at the cost of peak bandwidth and bus width. The wider your bus, the more traces you have to run in parallel from the socket to the compute complex, which becomes harder and harder. You don't see sockets for HBM or GDDR for a good reason. The proof is there.

LPCAMM solutions mentioned upthread resolve some of this by making the problem more "three dimensional" from what I can tell. They reduce the length of the traces by making the pinout more "square" (as opposed to thin and rectangular) and stacking them closer to the actual dies they connect to. This allows you to cram swappable memory into the same form factor, while retaining the same clock speeds/size/bus width, and without as many design complexities that come from complex socket traces.

In Apple's case they connect their GPU to the same pool of memory that their CPU uses. This is a key piece of the puzzle for their design, because even if the CPU doesn't need 200GB/s of bandwidth, GPUs are a very different story. If you want them to do work, you have to feed them with something, so you need lots of memory bandwidth to do that. Note that Samsung's LPCAMM solutions are only 128-bits wide and reported around 120GB/s. Apple's gone as high as 1024-bit busses with hundreds of GB/s of bandwidth; the M1 Max was released years ago and does 400GB/s. LPCAMM is still useful and a good improvement over the status quo, of course, but I don't think you're even going to see 256-bit or 512-bit versions just so soon.

And if your problem can be parallelized, then the higher your bus width, the lower your clock speeds can go, so you can get lower power while retaining the same level of performance. This same dynamic is how an A100 (1024-bit bus) can smoke a 3090 (384-bit) despite a far lower clock speed and power usage.

There is no magical secret or magical trick. You will always get better performance, less noise, at lower power by directly integrating these components together. It's a matter of if it makes sense given the rest of your design decisions -- like whether your GPU shares the memory pool or not.

There are alternative memory solutions like IBM using serial interfaces for disaggregating RAM and driving the clock speeds higher in the Power10 series, allowing you to kind of "socket-ify" GDDR. But these are mostly unobtainium and nobody is doing them in consumer stuff.