This is fine-tuning to make a well-behaved chatbot or something. To make a LLM you just need to predict the next token, or any masked token. Conceptually if you had a vast enough high-quality dataset and large-enough model, you wouldn't need fine-tuning for this.
A model which predicts one token at a time can represent anything a model that does a full sequence at a time can. It "knows" what it will output in the future because it is just a probability distribution to begin with. It already knows everything it will ever output to any prompt, in a sense.
A model which predicts one token at a time can represent anything a model that does a full sequence at a time can. It "knows" what it will output in the future because it is just a probability distribution to begin with. It already knows everything it will ever output to any prompt, in a sense.