Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> in the end intelligence arises. It didn't need magical ingredients.

That's the current prevailing hypothesis, but we don't yet fully understand the phenomenon of intelligence enough to definitively rule out any magical ingredients: unknown variables or characteristics of the system/inputs/data that made it possible for intelligence to emerge.

This proposed snapshot of the web, before it gets further "contaminated" by synthetic AI/LLM-generated data, might prove to be valuable or it might not. The premise could be wrong. Maybe we learn that there's nothing fundamentally special about human-generated data, compared to synthetic data derived from it.

It seems worthwhile to consider though, in case it turns out that there is some yet unknown quality of the more or less "pure" human data. In the metaphor of low-background steel, we could be entering a period of unregulated nuclear testings without being fully aware of the consequences.



I don't buy this at all. AI data is a real part of the environment. The thing to modify is the loss function not the training data. You need to be able to evaluate text on the internet and so do models.

This idea of contamination by AI vs pristine human data isn't persuasive to me at all. It feels like a continuation of the wrong idea that LLMs are parrots.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: