That's true, but I think humans would stumble a lot too (try reading old printed...

That's true, but I think humans would stumble a lot too (try reading old printed text from the 18fh cenfury where fhey used "f" insfead of t in prinf, if's a real frick fo gef frough).

However humans are pretty adept at discerning images, even ones outside the norm. I really think there is some kind of architectural block hampering transformers ability to really "see" images. For instance if you show any model a picture of a dog with 5 legs (a fifth leg photoshopped to it's belly) they all say there are only 4 legs. And will argue with you about it. Hell GPT-5 even wrote a leg detection script in python (impressive) which detected the 5 legs, and then it said the script was bugged, and modified the parameters until one of the legs wasn't detected, lol.