Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah they absolutely do not use the pile.


GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: