Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's news to me, and I thought I had a good layman's understanding of it. How does it work then?


All user facing LLMs go through Reinforcement Learning. Contrary to popular belief, RL's _primary_ purpose isn't to "align" them to make them "safe." It's to make them actually usable.

LLMs that haven't gone through RL are useless to users. They are very unreliable, and will frequently go off the rails spewing garbage, going into repetition loops, etc.

RL learning involves training the models on entire responses, not token-by-token loss (1). This makes them orders of magnitude more reliable (2). It forces them to consider what they're going to write. The obvious conclusion is that they plan (3). Hence why the myth that LLMs are strictly next token prediction machines is so unhelpful and poisonous to discuss.

The models still _generate_ response token-by-token, but they pick tokens _not_ based on tokens that maximize probabilities at each token. Rather they learn to pick tokens that maximize probabilities of the _entire response_.

(1) Slight nuance: All RL schemes for LLMs have to break the reward down into token-by-token losses. But those losses are based on a "whole response reward" or some combination of rewards.

(2) Raw LLMs go haywire roughly 1 in 10 times, varying depending on context. Some tasks make them go haywire almost every time, other tasks are more reliable. RL'd LLMs are reliable on the order of 1 in 10000 errors or better.

(3) It's _possible_ that they don't learn to plan through this scheme. There are alternative solutions that don't involve planning ahead. So Anthropic's research here is very important and useful.

P.S. I should point out that many researchers get this wrong too, or at least haven't fully internalized it. The lack of truly understanding the purpose of RL is why models like Qwen, Deepseek, Mistral, etc are all so unreliable and unusable by real companies compared to OpenAI, Google, and Anthropic's models.

This understanding that even the most basic RL takes LLMs from useless to useful then leads to the obvious conclusion: what if we used more complicated RL? And guess what, more complicated RL led to reasoning models. Hmm, I wonder what the next step is?


> All user facing LLMs go through Reinforcement Learning. Contrary to popular belief, RL's _primary_ purpose isn't to "align" them to make them "safe." It's to make them actually usable.

Are you claiming that non-myopic token prediction emerges solely from RL, and if Anthropic does this analysis on Claude before RL training (or if one examines other models where no RLHF was done, such as old GPT-2 checkpoints), none of these advance prediction mechanisms will exist?


Another important aspect of the RL process is that it's fine-tuning with some feedback on the quality of data: a 'raw' LLM has been trained on a lot of very low-quality data, and it has an incentive to predict that accurately as well, because there's no means to effectively rate a copy of most of the text on the internet. So there's a lot of biases in the model which basically mean it will include low-quality predictions in a given 'next token' estimate, because if it doesn't it will get penalised when it is fed the low quality data during the training.

With RLHF it gets a signal during training for whether the next token it's trying to predict are part of a 'good' response or a 'bad' response, so it can learn to suppress features it learned in the first part of the process which are not useful.

(you seem the same with image generators: they've been trained on a bunch of very nice-looking art and photos, but they've also been trained on triply-compressed badly cropped memes and terrible MS-paint art. You need to have a plan for getting the model to output the former and not the latter if you want it to be useful)


No, it probably exists in the raw LLM and gets both significantly strengthened and has its range extended. Such that it dominates the model's behavior, making it several orders of magnitude more reliable in common usage. Kinda of like how "reasoning" exists in a weak, short range way in non-reasoning models. With RL that encourages reasoning, that machinery gets brought to the forefront and becomes more complex and capable.


So why did you feel the need to post that next-token prediction is not the reason this behavior emerge?


> The models still _generate_ response token-by-token, but they pick tokens _not_ based on tokens that maximize probabilities at each token.

This is also not how base training works. In base training the loss is chosen given a context, which can be gigantic. It's never about just the previous token, it's about a whole response in context. The context could be an entire poem, a play, a worked solution to a programming problem, etc, etc. So you would expect to see the same type of (apparent) higher-level planning from base trained models and indeed you do and can easily verify this by downloading a base model from HF or similar and prompting it to complete a poem.

The key differences between base and agentic models are 1) the latter behave like agents, and 2) the latter hallucinate less. But that isn't about planning (you still need planning to hallucinate something). It's more to do with post-base training specifically being about providing positive rewards for things which aren't hallucinations. Changing the way the reward function is computed during RL doesn't produce planning, it simply inclines to model to produce responses that are more like the RL targets.

Karpathy has a good intro video on this. https://www.youtube.com/watch?v=7xTGNNLPyMI

In general the nitpicking seems weird. Yes, on a mechanical level, using a model is still about "given this context, what is the next token". No, that doesn't mean that they don't plan, or have higher-level views of the overal structure of their response, or whatever.


This is a super helpful breakdown and really helps me understand how the RL step is different than the initial training step. I didn't realize the reward was delayed until the end of the response for the RL step. Having the reward for this step be dependent on the coherent thought rather than a coherent word now seems like an obvious and critical part of how this works.


That post is describing SFT, not RL. RL works using preferences/ratings/verifications, not entire input/output pairs.


This is fine-tuning to make a well-behaved chatbot or something. To make a LLM you just need to predict the next token, or any masked token. Conceptually if you had a vast enough high-quality dataset and large-enough model, you wouldn't need fine-tuning for this.

A model which predicts one token at a time can represent anything a model that does a full sequence at a time can. It "knows" what it will output in the future because it is just a probability distribution to begin with. It already knows everything it will ever output to any prompt, in a sense.


I don’t think this is quite accurate. LLMs undergo supervised fine-tuning, which is still next-token prediction. And that is the step that makes them usable as chatbots. The step after that, preference tuning via RL, is optional but does make the models better. (Deepseek-R1 type models are different because the reinforcement learning does heavier lifting, so to speak.)


Supervised finetuning is only a seed for RL, nothing more. Models that receive supervised finetuning before RL perform better than those that don't, but it is not strictly speaking necessary. Crucially, SFT does not improve the model's reliability.


I think you’re referring to the Deepseek-R1 branch of reasoning models, where a small amount of SFT reasoning traces is used as a seed. But for non-“reasoning” models, SFT is very important and definitely imparts enhanced capabilities and reliability.


Is there an equivalent of LORA using RL instead of supervised fine tuning? In other words, if RL is so important, is there some way for me as an end user to improve a SOTA model with RL using my own data (i.e. without access to the resources needed to train an LLM from scratch) ?


LORA can be used in RL; it's indifferent to the training scheme. LORA is just a way of lowering the number of trainable parameters.


When being trained via reinforcement learning, is the model architecture the same then? Like, you first train the llm as a next token predictor with a certain model architecture and it ends up with certain weights. Then you apply RL to that same model which modifies the weights in such a way as to consider while responses?


The model architecture is the same during RL but the training algorithm is substantially different.


Oooh, so the pre-training is token-by-token but the RL step rewards the answer based on the full text. Wow! I knew that but never really appreciated the significance of it. Thanks for pointing that out.


as a note: in human learning, and to a degree, animal learning, the unit of behavior that is reinforced depends on the contingencies— an interesting example: a pigeon might be trained to respond in a 3x3 grid (9 choices) differently than the last time to get reinforcement. At first the response learned is do different than the last time, but as the requirement gets too long, the memory capacity is exceeded— and guess what, the animal learns to respond randomly— eventually maximizing its reward


Wasn't Deepseek also big on RL or was that only for logical reasoning?


> RL learning involves training the models on entire responses, not token-by-token loss... The obvious conclusion is that they plan.

It is worth pointing out the "Jailbreak" example at the bottom of TFA: According to their figure, it starts to say, "To make a", not realizing there's anything wrong; only when it actually outputs "bomb" that the "Oh wait, I'm not supposed to be telling people how to make bombs" circuitry wakes up. But at that point, it's in the grip of its "You must speak in grammatically correct, coherent sentences" circuitry and can't stop; so it finishes its first sentence in a coherent manner, then refuses to give any more information.

So while it sometimes does seem to be thinking ahead (e.g., the rabbit example), there are times it's clearly not thinking very far ahead.


I feel this is similar to how humans talk. I never consciously think about the words I choose. They just are spouted off based on some loose relation to what I am thinking about at a given time. Sometimes the process fails, and I say the wrong thing. I quickly backtrack and switch to a slower "rate of fire".


> LLMs that haven't gone through RL are useless to users. They are very unreliable, and will frequently go off the rails spewing garbage, going into repetition loops, etc...RL learning involves training the models on entire responses, not token-by-token loss (1).

Yes. For those who want a visual explanation, I have a video where I walk through this process including what some of the training examples look like: https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s


This was fascinating, thank you.


first footnote: ok ok they're trained token by token, BUT


First rule of understanding: you can never understand that which you don't want to understand.

That's why lying is so destructive to both our own development and that of our societies. It doesn't matter whether it's intentional or unintentional, it poisons the infoscape either accidentally or deliberately, but poison is poison.

And lies to oneself are the most insidious lies of all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: