Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because the LLM craze has rendered last-gen Tensor accelerators from NVIDIA (& others) useless for all those FP64 HPC workloads. From the article:

> The Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts), and the Blackwell B200 is rated at 33.3 gigaflops per watt (40 teraflops divided by 1,200 watts). The Blackwell B300 has FP64 severely deprecated at 1.25 teraflops and burns 1,400 watts, which is 0.89 gigaflops per watt. (The B300 is really aimed at low precision AI inference.)



Do cards with intentionally handicapped FP64 actually use anywhere near their TDP when doing FP64? It's my understanding that FP64 performance is limited at the hardware level--whether by fusing off the extra circuits, or omitting them from the die entirely--in order to prevent aftermarket unlocks. So I would be quite surprised if the card could draw that much power when it's intentionally using only a small fraction of the silicon.


It's really to save die space for other functions, AFAIU there is no fusing to lock the features or anything like this.


I'm finding conflicting info on this. It seems to be down to the specific GPU/core/microarchitecture. In some cases, the "missing" FP64 units do physically exist on the dies, but have been disabled--likely some of them were defective in manufacturing anyway--and this disabling can't be undone with custom firmware AFAIK (though I believe modern nVidia cards will only load nVidia-signed firmware anyway). Then, there are also dies that don't include the "missing" FP64 units at all, and so there's nothing to disable (though manufacturing defects may still lead to other components getting disabled for market segmentation and improved yields). This also seems to be changing over time; having lots of FP64 units and disabling them on consumer cards seems to have been more common in the past.

Nevertheless, my point is more that if FP64 performance is poor on purpose, then you're probably not using anywhere near the card's TDP to do FP64 calculations, so FLOPS/watt(TDP) is misleading.


In general: consumer cards with very bad FP64 performance have it fused off for product segmentation reasons, datacenter GPUs with bad FP64 performance have it removed from the chip layout to specialize for low precision. In either case, the main concern shouldn't be FLOPS/W but the fact that you're paying for so much silicon that doesn't do anything useful for HPC.


This theory only makes sense if consumer cards are sharing dies with enterprise/datacenter cards. If the consumer card SKUs are on their own dies, they're not going to etch something into silicon only to then fuse it off after the fact.

Regardless, there's "tricks" you can use to sort of extend the precision of hardware floating point - using a pair of e.g. FP32 numbers to implement something that's "almost" a FP64. Well known among numerics practitioners.


Until recently, consumer, workstation, and datacenter GPUs would all share a single core design that was instantiated in varying quantities per die to create a product stack. The largest die would often have little to no presence in the consumer market, but fundamentally it was made from the same building blocks. Now, having an entirely separate or at least heavily specialized microarchitecture for data center parts is common (because the extra design costs are worth it), but most workstation cards are still using the same silicon as consumer cards with different binning and feature fusing.


consumer cards don't share dies with datacenter cards, but they do share dies with workstation cards (the formerly quadro line), ex. the GB202 die is used by both the RTX PRO 5000/6000 Blackwell and the RTX 5090


I know some consumer cards have artificially limited FP64, but the AI focused datacenter cards have physically fewer FP64 units. Recently, the GB300 removed almost all of them, to the point that a GB300 actually has less FP64 TFLOPS than a 9 year old P100. FP32 is the highest precision used during training so it makes sense.


A 53×53 bit multiplier is more than 4× the size of a 24×24 bit multiplier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: