Both the text-to-speech and the speech-to-text models launched here suffer from reliability issues due to combining instructions and data in the same stream of tokens.
Thanks for the write up. I've been writing assembly lately, so as soon as I read your comment, I thought "hmm reminds me of section .text and section .data".
I'm not yet sure how much of a problem this is for real-world applications. I wrote a few notes on this here: https://simonwillison.net/2025/Mar/20/new-openai-audio-model...