Because the LLM craze has rendered last-gen Tensor accelerators from NVIDIA (& o...

kbolino · 2025-09-08T14:19:10 1757341150

Do cards with intentionally handicapped FP64 actually use anywhere near their TDP when doing FP64? It's my understanding that FP64 performance is limited at the hardware level--whether by fusing off the extra circuits, or omitting them from the die entirely--in order to prevent aftermarket unlocks. So I would be quite surprised if the card could draw that much power when it's intentionally using only a small fraction of the silicon.

Aissen · 2025-09-08T15:00:09 1757343609

It's really to save die space for other functions, AFAIU there is no fusing to lock the features or anything like this.

kbolino · 2025-09-08T15:42:43 1757346163

I'm finding conflicting info on this. It seems to be down to the specific GPU/core/microarchitecture. In some cases, the "missing" FP64 units do physically exist on the dies, but have been disabled--likely some of them were defective in manufacturing anyway--and this disabling can't be undone with custom firmware AFAIK (though I believe modern nVidia cards will only load nVidia-signed firmware anyway). Then, there are also dies that don't include the "missing" FP64 units at all, and so there's nothing to disable (though manufacturing defects may still lead to other components getting disabled for market segmentation and improved yields). This also seems to be changing over time; having lots of FP64 units and disabling them on consumer cards seems to have been more common in the past.

Nevertheless, my point is more that if FP64 performance is poor on purpose, then you're probably not using anywhere near the card's TDP to do FP64 calculations, so FLOPS/watt(TDP) is misleading.

wtallis · 2025-09-08T16:26:05 1757348765

In general: consumer cards with very bad FP64 performance have it fused off for product segmentation reasons, datacenter GPUs with bad FP64 performance have it removed from the chip layout to specialize for low precision. In either case, the main concern shouldn't be FLOPS/W but the fact that you're paying for so much silicon that doesn't do anything useful for HPC.

zozbot234 · 2025-09-08T17:36:56 1757353016

This theory only makes sense if consumer cards are sharing dies with enterprise/datacenter cards. If the consumer card SKUs are on their own dies, they're not going to etch something into silicon only to then fuse it off after the fact.

Regardless, there's "tricks" you can use to sort of extend the precision of hardware floating point - using a pair of e.g. FP32 numbers to implement something that's "almost" a FP64. Well known among numerics practitioners.

wtallis · 2025-09-08T20:32:31 1757363551

Until recently, consumer, workstation, and datacenter GPUs would all share a single core design that was instantiated in varying quantities per die to create a product stack. The largest die would often have little to no presence in the consumer market, but fundamentally it was made from the same building blocks. Now, having an entirely separate or at least heavily specialized microarchitecture for data center parts is common (because the extra design costs are worth it), but most workstation cards are still using the same silicon as consumer cards with different binning and feature fusing.

RicardoLuis0 · 2025-09-08T20:31:27 1757363487

consumer cards don't share dies with datacenter cards, but they do share dies with workstation cards (the formerly quadro line), ex. the GB202 die is used by both the RTX PRO 5000/6000 Blackwell and the RTX 5090

niklassheth · 2025-09-08T16:50:51 1757350251

I know some consumer cards have artificially limited FP64, but the AI focused datacenter cards have physically fewer FP64 units. Recently, the GB300 removed almost all of them, to the point that a GB300 actually has less FP64 TFLOPS than a 9 year old P100. FP32 is the highest precision used during training so it makes sense.

kragen · 2025-09-09T07:27:21 1757402841

A 53×53 bit multiplier is more than 4× the size of a 24×24 bit multiplier.