> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.
Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.
It does seem to raise fair questions about either the utility of these tools, or adoption inertia. If not even OpenAI feels compelled to integrate this kind of model-check into their pipeline, what's that say about the business world at-large? Is it that it's too onerous to set up, is it that it's too hard to get only true-positive corrections, is it that it's too low value for the effort?
Businesses do whatever’s cheap. AI labs will continue making their models smarter, more persuasive. Maybe the SWE profession will thrive/transform/get massacred. We don’t know.
If the claims in the abstract are true, then this is legitimately revolutionary. I don’t believe it. There are probably some major constraints/caveats that keep these results from generalizing. I’ll read through the paper carefully this time instead of a skim and come back with thoughts after I’ve digested it.
What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture.
And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.
Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.
DeepSeek-V3.2 is a sparse attention architecture, while Zebra-Llama is a hybrid attention/SSM architecture. The outcome might be similar in some ways (close to linear complexity) but I think they are otherwise quite different.
His specific thesis is that pods fundamentally clean worse than powder because they're inherently single-stage releases of detergent in machines designed for two-stage releases. Despite this, he still explicitly says that pods have their uses. So I'm unclear on how his goal is "proving that everyone is wrong." Did we watch different videos?
Out of 5 machines I've used at different apartments, none had a separate pre-wash dispenser. And I've saved manual for my current one, it says nothing about adding detergent additionally to the dishes. And all of them washed just fine with powder, without any additional mumbo-jumbo.
i have dishwasher that is loaded with cartridge that has 400g of powder. ideal scenario for dispensing detergent at will. yet, never mind what cycle I am using, it dispensed only during main wash cycle.
i also had in past machines from 5 different manufacturers. none of them had mechanisms that facilitate 2 releases or pre-wash compartments
> i also had in past machines from 5 different manufacturers. none of them had mechanisms that facilitate 2 releases or pre-wash compartments
did you check the manual?
I think in a previous video he mentioned that for machines like that it was stated in manual to add powder for prewash directly in the machine.
they all washed dishes just fine without any prewash powder added. somebody "here" even quoted bosch manual that there is no need in prewash powder. i most of the time use cycle that doesn't even has prewash
I'm not sure I understand your statement. Are you implying that once an LLM can do something, "it" is not intelligent anymore? ("it" being the model, the capability, or both?)
I would bet that it's far lower now. Inference is expensive we've made extraordinary efficiency gains through techniques like distillation. That said, GPT-5 is a reasoning model, and those are notorious for high token burn. So who knows, it could be a wash. But selective pressures to optimize for scale/growth/revenue/independence from MSFT/etc makes me think that OpenAI is chasing those watt-hours pretty doggedly. So 0.34 is probably high...
a) training is where the bulk of an AI system's energy usage goes (based on a report released by Mistral)
b) video generation is very likely a few orders of magnitude more expensive than text generation.
That said, I still believe that data centres in general - including AI ones - don't consume a significant amount of energy compared with everything else we do, especially heating and cooling and transport.
Pre-LLM data centres consume about 1% of the world's electricity. AI data centres may bump that up to 2%
You gotta start thinking about the energy used to mine and refine the raw materials used to make the chips and GPUs. Then take into account the infrastructure and data centers.
This might be a dumb question but like...why does it matter? Are other companies reporting training run costs including amortized equipment/labor/research/etc expenditures? If so, then I get it. DeepSeek is inviting an apples-and-oranges comparison. If not, then these gotcha articles feel like pointless "well ackshually" criticisms. Akin to complaining about the cost of a fishing trip because the captain didn't include the price of their boat.
Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.
reply