I haven't read through the entire thing yet, but the long abstract combined with the way the acronym BDH is introduced (What does the B stand for?) along with the very "flowery" name (When neither "dragon" nor "hatchling" appears again past page 2) is rather offputting
- It seems strange to make use of the term "scale-free" and then defer a definition until half way through the paper (in fact, the term is mentioned 3 times after, and 14 times before said definition)
- This might just be CS people doing CS things, but the notation in the paper is awful: Claims/Observations end with a QED-symbol (for example on pages 29 and 30) but without a proof
- They make strong claims about performance and scaling ("It exhibits
Transformer-like scaling laws") but the only (i think?) benchmark is a translation task comparison with <1B models, ,which is ~2 orders of magnitude smaller than sota
Author comment: as a fairly common convention, QED immediately after a particular statement means that the statement should be considered proven. Depending on the text, this may either be because the statement (Observation) is self-explanatory, or, the discussion in the text leading up to the statement is sufficient, or, whenever the final statement of a Theorem follows as a direct corollary of Lemmas previously proven in the text.
I could agree with that, but the example on p29 (Claim 6) ends with QED, but only then the proof follows. I realize I'm nitpicking form here, but still
- It seems strange to make use of the term "scale-free" and then defer a definition until half way through the paper (in fact, the term is mentioned 3 times after, and 14 times before said definition)
- This might just be CS people doing CS things, but the notation in the paper is awful: Claims/Observations end with a QED-symbol (for example on pages 29 and 30) but without a proof
- They make strong claims about performance and scaling ("It exhibits Transformer-like scaling laws") but the only (i think?) benchmark is a translation task comparison with <1B models, ,which is ~2 orders of magnitude smaller than sota