> First of all, training off of data generated by another AI is generally a bad ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		sailingparrot 10 months ago \| parent \| context \| favorite \| on: OpenAI says it has evidence DeepSeek used its mode... > First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually). That is not true at all. We have known how to solve this for at least 2 years now. All the latest state of the art models depend heavily on training on synthetic data.

bjourne 10 months ago [–]

https://www.nature.com/articles/s41586-024-07566-y

sailingparrot 10 months ago | [–]

Key point from your linked paper:

> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models

No one is training on indiscriminate synthetic data. It's very much discriminated, but still synthetic.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact