Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.

This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.



This is not obvious to me! For example, if you locked me in a room with no information inputs, over time I may still become more intelligent by your measures. Through play and reflection I can prune, reconcile and generate. I need compute to do this, but not necessarily more knowledge.


Again, this isn't how distillation work. Your task as the distillation model is to copy mistakes, and you will be penalized by pruning reconciling and generating.

"Play and reflection" is something else, which isn't distillation.


The initial claim was that distillation can never be used to create a model B that's smarter than model A, because B only has access to A's knowledge. The argument you're responding to was that play and reflection can result in improvements without any additional knowledge, so it is possible for distillation to work as a starting point to create a model B that is smarter than model A, with no new data except model A's outputs and then model B's outputs. This refutes the initial claim. It is not important for distillation alone to be enough, if it can be made to be enough with a few extra steps afterward.


You’ve subtly confused “less accurate” and “smarter” in your argument. In other words you’ve replaced the benchmark of representing the base data with the benchmark of reasoning score.

Then, you’ve asserted that was the original claim.

Sneaky! But that’s how “arguments” on HN are “won”.


No, I didn't confuse the two. There is not a formal definition of "smart", but if you're claiming that factual accuracy is unrelated to it, I can't imagine that that's in good faith.


LLMs are no longer trying to just reproduce the distribution of online text as a whole to push the state of the art, they are focused on a different distribution of “high quality” - whatever that means in your domain. So it is possible that this process matches a “better” distribution for some tasks by removing erroneous information or sampling “better” outputs more frequently.


While that is theoretically true, it misses everything interesting (kind of like the No Free Lunch Theorem, or the VC dimension for neural nets). The key is that the parent model may have been trained on a dubious objective like predicting the next word of randomly sampled internet text - not because this is the objective we want, but because this is the only way to get a trillion training points.

Given this, there’s no reason why it could not be trivial to produce a child model from (filtered) parent output that exceeds the child model on a different, more meaningful objective like being a useful chatbot. There's no reason why this would have to be limited to domains with verifiable answers either.


The latest models create information from base models by randomly creating candidate responses then pruning the bad ones using an evaluation function. The good responses improve the model.

It is not distillation. It's like how you can arrive at new knowledge by reflecting on existing knowledge.


Fine tuning an llm on the output of another llm is exactly how deepseek made its progress. The way they got around the problem you describe is by doing this in a domain that can be relatively easily checked for correctness, so suggested training data for fine tuning could be automatically filtered out if it was wrong.


> It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.

Unfiltered? Sure. With human curation of the generated data it certainly can. (Even automated curation can do this, though its more obvious that human curation can.)

I mean, I can randomly developed fact claims about addition, and if I curate which ones go into a training set, train a model that reflects addition of integers much more accurately than the random process which generated the pre-curation input data.

Without curation, as I already said, the best you get is a distillation of the source model, which is highly improbable to be more accurate.


No one knows if the pigeon-hole principle applies absolutely exclusive to the ability to generalize outside of a training set.

That is the existential, $1T question.


No no no you don’t understand, the models will magically overcome issues and somehow become 100x and do real AGI! Any day now! It’ll work because LLM’s are basically magic!

Also, can I have some money to build more data centres pls?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: