> Performance of BDH-GPU and GPTXL versus model size on a translation task. [......

> Performance of BDH-GPU and GPTXL versus model size on a translation task. [...] On the other hand, GPTXL [...] required Dropout [...] The model architecture follows GPT2

I love when a new architecture comes out, but come on, it's 2025, can we please stop comparing fancy new architectures to the antiquated GPT2? This makes the comparison practically, well, useless. Please pick something more modern! Even the at-this-point ubiquitous Llama would be a lot better. I don't want to have to spend days of my time doing my own benchmarks to see how it actually compares to a modern transformer (and realistically, I was burned so many times now that I just stopped bothering).

Modern LLMs are very similar to GPT2, but those architectural tweaks do matter and can make a big difference. For example, take a look at the NanoGPT speedrun[1] and look at how many training speedups they got by tweaking the architecture.

Honestly, everyone who publishes a paper in this space should read [2]. The post talks about optimizers, but this is also relevant to new architectures too. Here's the relevant quote:

> With billions of dollars being spent on neural network training by an industry hungry for ways to reduce that cost, we can infer that the fault lies with the research community rather than the potential adopters. That is, something is going wrong with the research. Upon close inspection of individual papers, one finds that the most common culprit is bad baselines [...]

> I would like to note that the publication of new methods which claim huge improvements but fail to replicate / live up to the hype is not a victimless crime, because it wastes the time, money, and morale of a large number of individual researchers and small labs who run and are disappointed by failed attempts to replicate and build on such methods every day.

Sure, a brand new architecture is most likely not going to compare favorably to a state-of-art transformer. That is fine! But at least it will make the comparison actually useful.

[1] -- https://github.com/KellerJordan/modded-nanogpt

[2] -- https://kellerjordan.github.io/posts/muon/#discussion-solvin...