Isn't that essentially how the MoE models already work? Besides, if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost?
Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.
MoE models are pretty poorly named since all the "experts" are "the same". They're probably better described as "sparse activation" models. MoE implies some sort of "heterogenous experts" that a "thalamus router" is trained to use, but that's not how they work.
> if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost
The compute/intelligence curve is not a straight line. It's probably more a curve that saturates, at like 70% of human intelligence. More compute still means more intelligence. But you'll never reach 100% human intelligence. It saturates way below that.
Thanks, I wasn't aware of that. Still - why isn't there a super expensive OpenAI model that uses 1,000 experts and comes up with way better answers? Technically that would be possible to build today. I imagine it just doesn't deliver dramatically better results.
Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.