We know quite well how it does it. It's applying extrapolation to its lossily compressed representation. It's not magic and especially the HN crowd of technical profficient folks should stop treating it as such.
That is not a useful explanation. "Applying extrapolation to its lossily compressed representation" is pretty much the definition of understanding something. The details and interpretation of the representation are what is interesting and unknown.
We can use data based on analyzing the frequency of ngrams in a text to generate sentences, and some of them will be pretty good, and fool a few people into believing that there is some solid language processing going on.
LLM AI is different in that it does produce helpful results, not only entertaining prose.
It is practical for users to day to replace most uses of web search with a query to a LLM.
The way the token prediction operates, it uncovers facts, and renders them into grammatically correct language.
Which is amazing given that, when the thing is generating a response that will be, say, 500 tokens long, when it has produced 200 of them, it has no idea what the remaining 300 will be. Yet it has committed to the 200; and often the whole thing will make sense when the remaining 300 arrive.
The research posted demonstrates the opposite of that within the scope of sequence lengths they studied. The model has future tokens strongly represented well in advance.
We know quite well how it does it. It's applying extrapolation to its lossily compressed representation. It's not magic and especially the HN crowd of technical profficient folks should stop treating it as such.