People say this, but when it comes to AI models, the training data is not owned ...

sheepdestroyer · 2025-01-29T16:01:29 1738166489

They could easily list the data used though. These datasets are mostly known and floating around. When they are constructed, instructions for replication could be provided too

coliveira · 2025-01-29T16:03:41 1738166621

They could, but even if they give this list the detractors will still say it is not open source.

rvnx · 2025-01-29T16:24:17 1738167857

yes and as a bonus they may get sued, which in the long-term, makes free / offline models to not be viable

It would be so much better if all models were trained with LibGen.

Timon3 · 2025-01-29T17:57:38 1738173458

Isn't this the same situation that any codebase faces when one thinks about open sourcing it? I can't legally open source the code I don't own.