I see many people get confused by this due to the widely spread (and false) "sto...

I see many people get confused by this due to the widely spread (and false) "stochastic parrot" theme. But these models are much more than mere senzence-repeaters. In a way, the model is not learning that after A comes B. I mean, it could. With a lack of additional training data it probably would, too. But with enough data, this kind of sentence completion based purely on existing sentences no longer works because it would saturate parameters. So to retain and improve accuracy during training, it will have to come up with a compression that essentially forms a model of the real world. Or at least the world that the training corpus describes [1]. In that sense, it no longer "knows" that B comes after A (except for the input context), but it would have learned that there is a special relation between A and B. In can then also apply this kind of learned logic to new concepts that appear first in the context during inference. With all that happening internally, it only has to morph this state back into a natural language output. But with billions of parameters and countless layers, there is more than enough computational room for this to happen. In fact, recent models have shown that even small models can get pretty good at logic if you only get the training data right.

[1] https://arxiv.org/abs/2210.13382