Since a major part of the article covers cost expenditures, I am going to go there.
I don't think it is possible to trust DeepSeek as they haven't been honest.
DeepSeek claimed "their total training costs amounted to just $5.576 million"
SemiAnalysis "Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters. Similarly, all AI Labs and Hyperscalers have many more GPUs for various tasks including research and training then they they commit to an individual training run due to centralization of resources being a challenge. X.AI is unique as an AI lab with all their GPUs in 1 location."
SemiAnalysis "We believe the pre-training number is nowhere the actual amount spent on the model. We are confident their hardware spend is well higher than $500M over the company history. To develop new architecture innovations, during the model development, there is a considerable spend on testing new ideas, new architecture ideas, and ablations. Multi-Head Latent Attention, a key innovation of DeepSeek, took several months to develop and cost a whole team of manhours and GPU hours.
The $6M cost in the paper is attributed to just the GPU cost of the pre-training run, which is only a portion of the total cost of the model. Excluded are important pieces of the puzzle like R&D and TCO of the hardware itself. For reference, Claude 3.5 Sonnet cost $10s of millions to train, and if that was the total cost Anthropic needed, then they would not raise billions from Google and tens of billions from Amazon. It’s because they have to experiment, come up with new architectures, gather and clean data, pay employees, and much more."
The NIST report doesn't engage with training costs, or even token costs. It's concerned with the cost the end user pays to complete a task. Actually their discussion of cost is interesting enough I'll quote it in full.
> Users care both about model performance and the expense of using models. There are multiple different types of costs and prices involved in model creation and usage:
> • Training cost: the amount spent by an AI company on compute, labor, and other inputs to create a new model.
> • Inference serving cost: the amount spent by an AI company on datacenters and compute to make a model available to end users.
> • Token price: the amount paid by end users on a per-token basis.
> • End-to-end expense for end users: the amount paid by end users to use a model to complete a task.
> End users are ultimately most affected by the last of these: end-to-end expenses. End-to-end expenses are more relevant than token prices because the number of tokens required to complete a task varies by model. For example, model A might charge half as much per token as model B does but use four times the number of tokens to complete an important piece of work, thus ending up twice as expensive end-to-end.
This might be a dumb question but like...why does it matter? Are other companies reporting training run costs including amortized equipment/labor/research/etc expenditures? If so, then I get it. DeepSeek is inviting an apples-and-oranges comparison. If not, then these gotcha articles feel like pointless "well ackshually" criticisms. Akin to complaining about the cost of a fishing trip because the captain didn't include the price of their boat.
I don't think it is possible to trust DeepSeek as they haven't been honest.
DeepSeek claimed "their total training costs amounted to just $5.576 million"
SemiAnalysis "Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters. Similarly, all AI Labs and Hyperscalers have many more GPUs for various tasks including research and training then they they commit to an individual training run due to centralization of resources being a challenge. X.AI is unique as an AI lab with all their GPUs in 1 location."
SemiAnalysis "We believe the pre-training number is nowhere the actual amount spent on the model. We are confident their hardware spend is well higher than $500M over the company history. To develop new architecture innovations, during the model development, there is a considerable spend on testing new ideas, new architecture ideas, and ablations. Multi-Head Latent Attention, a key innovation of DeepSeek, took several months to develop and cost a whole team of manhours and GPU hours.
The $6M cost in the paper is attributed to just the GPU cost of the pre-training run, which is only a portion of the total cost of the model. Excluded are important pieces of the puzzle like R&D and TCO of the hardware itself. For reference, Claude 3.5 Sonnet cost $10s of millions to train, and if that was the total cost Anthropic needed, then they would not raise billions from Google and tens of billions from Amazon. It’s because they have to experiment, come up with new architectures, gather and clean data, pay employees, and much more."
Source: https://semianalysis.com/2025/01/31/deepseek-debates/