I did a back-of-the envelope calculation, OpenAI has 100M monthly active users, assume 10K tokens per user per month usage ($20 would pay for 600K tokens on the API) then they generate 1T tokens per month.
This dataset would be focused on human interests (in domain for users) and containing AI errors (in domain for the model). It's LLM empowered with human in the loop and tools - code execution, search, APIs. So it is a good basis for the next dataset. I think OpenAI has amassed about as much chat log text as there is organic data was used for GPT-4, which was rumoured to be 13T tokens.
It's surprising how much synthetic data can be generated per year. And OpenAI can do this with human in the loop for free, if the paying users pay for everyone. We then benefit 6-12 months later when the open source models trained with data exfiltrated from OpenAI models catch up.