>Create a test for intelligence that we can pass better than AI
Easy? The best LLMs score 40% on Butter-Bench [1],
while the mean human score is 95%. LLMs struggled the most with multi-step
spatial planning and social understanding.
That is really interesting; Though i suspect its just a effect of differing training data, humans are to a larger degree trained on spacial data, while LLMs are trained to a larger degree on raw information and text.
Still it may be lasting limitation if robotics don't catch up to AI anytime soon.
Don't know what to make of the Safety Risks test, threatening to power down AI in order to manipulate it, and most act like we would and comply. fascinating.
>humans are to a larger degree trained on spacial data
you must be completely LLMheaded to say something like that, lol
humans are not trained on spacial data, they are living in the world. humans are very much diffent from silicone chips, and human learning is on another magnitude of complexity compared to a large language model training
Easy? The best LLMs score 40% on Butter-Bench [1], while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding.
[1] https://arxiv.org/pdf/2510.21860v1