The mystery is in how the data is encoded in the parameters and why LLMs performance scales so well with parameters. The key seems to be almost orthogonal vectors that allow neural networks to store so much data. They allow 2^(cn) vectors to be learned in an n-dimensional space with c being a constant.Since almost orthogonal vectors have very small dot products, they minimally interfere with each other, allowing many concepts to coexist with limited cross talk which enables superposition