Hacker Newsnew | past | comments | ask | show | jobs | submit | mbowcut2's commentslogin

It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.


The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.


If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.


Generally, the easiest:

1. Sample a set of prompts / answers from historical usage.

2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.

3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.

4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.


How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?


Just grab the top ~30 models on OpenRouter[1] and test them all. If that's too expensive make a sample 'screening' benchmark that's just a few of the hardest problems to see if it's even worth the full benchmark.

1. https://openrouter.ai/models?order=top-weekly&fmt=table


Thank you! I'll see about building a test suite.

Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.

Advice welcome!


Yeah - things are easy when you can objectively score an output, otherwise as you said you'll probably need another LLM to score it. For summaries you can try to make that somewhat more objective, like length and "8/10 key points are covered in this summary."

This is a real training method (like Group Relative Policy Optimization), so it's a legitimate approach.


Thank you. I will google Group Relative Policy Optimization to learn about that and the other training methods. If you have any resources handy that I should be reading, that would be appreciated as well. Have a great weekend.


Nothing off the top of my head! If you find anything good let me know. GRPO is a training technique likely not exactly what you'd do for benchmarking, but it's interesting to read about anyway. Glad I cuold help


I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

The only exception I can think of is models trained on synthetic data like Phi.


If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)


Americans have an opposing bias via the phenomenon of "safe edgy", where for obvious reasons they're uncomfortable with being biased towards anyone who looks like a US minority, and redirect all that energy towards being racist to the French. So it's all balanced.


Seems like the less sexy headline is just something about the sample size needed for LLM fact encoding That's honestly a more interesting angle to me: How many instances of data X needs to be in the training data for the LLM to properly encode it? Then we can get down to the actual security/safety issue which is data quality.


I'm not surprised. People really thought the models just kept getting better and better?


The models are getting better and better.


That's expected. No one will release a worse model.


Not a cheaper one, or better in some ways, or lower latency, etc?


They do that too but right now it is an arms race as well.


Maybe. How would I know?


...even if the agent did "cheat", I think that having the capacity to figure out that it was being evaluated, find the repo containing the logic of that evaluation, and find the expected solution to the problem it faced... is "better" than anything that the models were able to do a couple years ago.


it looks like the 2nd and 3rd bar never got updated from the dummy data placeholders lol.


It's not a new problem (for individuals), though perhaps at an unprecedented scale (so, maybe a new problem for civilization). I'm sure there were black smiths that felt they had lost their meaning when they were replaced by industrial manufacturing.


I've had similar experiences with vanilla ChatGPT as a DM but I bet with clever prompt engineering and context window management you could solve or at least dramatically improve the experience. For example, you could have the model execute a planning step before your session in which it generates a plot outline, character list, story tree, etc. which could then be used for reference during the game session.

One problem that would probably still linger is model agreeableness, i.e. despite preparation, models have a tendency to say yes to whatever you ask for, and everybody knows a good DM needs to know when to say no.


The problem with embeddings is that they're basically inscrutable to anything but the model itself. It's true that they must encode the semantic meaning of the input sequence, but the learning process compresses it to the point that only the model's learned decoder head knows what to do with it. Anthropic's developed interpretable internal features for Sonnet 3 [1], but from what I understand that requires somewhat expensive parallel training of a network whose sole purpose is attempt to disentangle LLM hidden layer activations.

[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...


Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance.

I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).


I've read that "No Position Embedding" seems to be better for long-term context anyway, so it's probably not something essential to explain.


Do you have a citation for the paper on that? IME, that's not really something you see used in practice, at least not after 2022 or so. Without some form positional adjustment, transformer-based LLMs have no way to differentiate from "The dog bit the man." and "The man bit the dog." given the token ids are nearly identical. You just end up back in the bag-of-words problem space. The self-attention mechanism is permutation-invariant, so as long as it remains true that the attention scores are computed as an unordered set, you need some way to model the sequence.

Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.

EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955


The paper showing that dropping positional encoding entirely is feasible is https://arxiv.org/pdf/2305.19466 . But I was misremembering as to its long context performance, Llama 4 does use NoPE but it's still interleaved with RoPE layers. Just an armchair commenter though, so I may well be wrong.

My intuition for NoPE was that the presence of the causal mask provides enough of a signal to implicitly distinguish token position. If you imagine the flow of information in the transformer network, tokens later on in the sequence "absorb" information from the hidden states of previous tokens, so in this sense you can imagine information flowing "down (depth) and to the right (token position)", and you could imagine the network learning a scheme to somehow use this property to encode position.


Ah didn't realize you were referring to NoPE explicitly. And yea, the intuitions gained from that paper are pretty much what I alluded to above, you don't get away with never modeling the positional data, the question is how you model it effectively and from where do you derive that signal.

NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax).

There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term).

This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation.

[1] https://arxiv.org/abs/2501.00073


I found decent results using multiclass spectral clustering to query embedding spaces.

https://ieeexplore.ieee.org/document/10500152

https://ieeexplore.ieee.org/document/10971523


This is exactly the challenge. When embedding were first popularized in word to vec they were interpretable because the word2vec model was revealed to be a batched matrix factorization [1].

LLM embedding are so abstract and far removed from a human interpretable or statistical corollary that even as the embeddings contain more information, that information becomes less accessible to humans.

[1] https://papers.nips.cc/paper_files/paper/2014/hash/b78666971...


> learned decoder head

That's a really interesting three-word noun-phrase. Is it a term-of-art, or a personal analogy?


Can't you decode the embeddings to tokens for debugging?


You can but this is lossy (as it drops context; it’s a dimensionality reduction from 512 or 1024 to a few bytes) and non-reconvertible.


I mean thats true for all DL layers, but we talk about convolutions and stuff often enough. Embedding are relatively new but theres not alot of discussion as to how crazy they are, especially given that they are the real star of the LLM, with transformers being a close second imo


You can search the closest matching words or expressions in a dictionary. It is trivial to understand where an embedding points to.


Can you do that in the middle of the layers? And if you do, would that word be that meaningful to the final output? Genuinely curious.


You can, and there has been some interesting work done with it. The technique is called LogitLens, and basically you pass intermediate embeddings through the LMHead to get logits corresponding to tokens. In this paper they use it to investigate whether LLMs have a language bias, i.e. does GPT "think" in English? https://arxiv.org/pdf/2408.10811

One problem with this technique is that the model wasn't trained with intermediate layers being mapped to logits in the first place, so it's not clear why the LMHead should be able to map them to anything sensible. But alas, like everything in DL research, they threw science at the wall and a bit stuck.


LLMs are better at LaTeX than humans. ChatGPT often writes LaTeX responses.


Yeah, it's honestly one of the things they're best at!

I've been working on implementing some E&M simulations with Claude Code and it's so-so on the C++ and TERRIBLE at the actual math (multiplying a couple 6x6 matrix differential operators is beyond it).

But I can dash off some notes and tell Claude to TeXify and the output is great.


I think I agree with you. My only rebuttal would be it's this kind of thinking that's kept any leading players form trying other architectures in the first place. As far as I know, SOTA for SSM's just doesn't suggest significant enough potential upsides warrant significant R&D. Not compared to the tried and true established LLM methods. The decision might be something like: "Pay X to train a competitive LLM" vs "Pay 2X to MAYBE train a competitive SSM".


I read this as "pirate space industry" and got real excited.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: