Yeah they absolutely do not use the pile.

talldayo · on April 16, 2024

GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.