This proposal gets made pretty frequently in one form or another, and (at least on HN) seems to usually get struck down on this or that procedural ground.
But as the various regulatory and judicial and legislative processes grind through different parts of the modern intellectual property issue made so abundantly legible by the modern AI training data gold rush it seems ever more clear that one way or another, we’re going to get a new social contract on IP.
Leaving aside for a moment the thicket of laws, precedents, jurisdictions, and regulatory inertia: we can vote with our feet as both customers and contributors for common sense now.
So how about the following compromise: promote innovation by liberalizing the posture around training on roughly “the commons”, but insist that the resulting weights are likewise available to the public. Why do I have to take someone’s word for it that they’ve got a result around superposition or whatever on mech interp? I’d like to see it work given it’s everyone’s data pushing those weights.
I speak only for myself but plenty of people seem to agree: I don’t mind big companies training on generally available data, I mind the IP-laundering. Compete on cost, compete on value-added software stacks, compete on vertical integration. There is lots of money to be made building a better mousetrap in terms of code and infrastructure and product innovation.
Conduct the research in the open. None of this would be possible without an ocean of research and data subsidized in whole or in part by the public. Asserting any form of ownership over the result might end up being legal, but it will never be ethical.
Meta isn’t perfect on this stuff, but they’re by far the actor pulling the conversation in that direction. Let’s encourage them to continue pushing the pace on stuff like LLaMA 3.
> But as the various regulatory and judicial and legislative processes grind through different parts of the modern intellectual property issue made so abundantly legible by the modern AI training data gold rush it seems ever more clear that one way or another, we’re going to get a new social contract on IP.
what do you mean by that? as far i'm aware ANYTHING that you publish despite being on the internet or not, if there isn't a copyright notice, you should assume -> "all rights reserved"
Yes, that's correct. Under the Berne Convention all copyright for a work and any derivatives is held with the author, unless the explicitly disclaim it or another legal provision applies (eg fair use for teaching or parody).
However, does an LLM count as a derivative work or a transformative one? That's something for the lawyers to answer.
Artificial Intelligence currently has no concept of responsibility (not legal, not ethical), and it will never have existential threats derived from law. The only way that I can think of, as of right now, is that every single product touched by AI must have a human who is legally responsible for it.
This has an easy answer — it’s just not the one that people who desperately want to use LLMs for copyright washing want to hear.
The test for what constitutes a derivative work has not changed; it’s the same whether a single human author produced something, or a team of humans, or an LLM. It will be up to a court to decide whether a work is similar enough to be considered derivative.
If an LLM spits out a verbatim copy, that’s obviously infringement. But if the LLM spits out something similar? Well if the LLM spits out something like George Harrison’s My Sweet Lord [1], a court may well decide that it’s derivative of He’s So Fine. Especially if the LLM “subconsciously” “knew” about He’s So Fine because it was part of the training corpus.
what are the opinions on [0]? what's the scene for language rather than image?
also what are the opinions on turning generative AI (that doesn't ask permission to creators) public domain? donation money that surpasses the cost of hosting the work to people/groups "creating" with AI, should be a violation of the license? are you allowed to play with the models in hardware made by for-profit entities, like Nvidia?
- regurgitate entire passages word for word, until that behavior is publicized and quickly RLHF'd away
- rip github repos almost entirely (some new Sonnet 3.5 demos Anthropic employees were bragging about on Twitter were basically 1:1 to a person's public repo)
It seems clear to me that not only can copyrighted work be retained and returned in near entirety by the architectures that undergird current frontier models, but the engineers working on these models will readily confuse a model regurgitating work to be "creating novel work".
LLMs are not a general public benefit. Artists whose work is trained upon by text-to-image models aren't made any more whole just because Meta has to share its weights—it just means it's even cheaper for the folks impersonating them or effortlessly ripping off their style to keep doing so.
Meta really does not need to be subsidized when they have so many resources at hand—if LLMs are really hard to train without that much data, then perhaps that's a flaw with the approach instead of something the world has to accommodate.
> I don’t mind big companies training on generally available data, I mind the IP-laundering
large platforms saw AI and instantly closed their platforms making it hard or impossible for external actors to mine that "generally available data," hurting their own users and the open web in the process, and then they mined the data themselves.
As long as a "user" can access those platforms then that data can and will be mined. The people working on such solutions just dont publish them publicly until the debate is settled. If I can view information on any website, authenticated or not, then I can build a bot that will do the same. This does not mean its being done for nefarious reasons. Simply automating the process of bringing what I consider valuable information to me is enough motivation to do it. In my case, the only profit I make is saving time not manually clicking around to access the data I read over morning coffee.
The internet routes around censorship. Its impossible to hide information as long as its meant to be accessed by a human. If companies want to spend engineering hours putting locks then thats their waste.
Many businesses will fail by wasting time and money creating locks that can and will be circumvented.
I agree that a new social contract is inevitable because the only way to prevent data from being mined is to not produce it to begin with. Period. This I know.
When I was young I used to upload fan art of Naruto to DeviantArt. My badly scanned drawings sucked. Everyone else's sucked. It was cool.
Today DeviantArt has its own AI which it promotes over their own users' work. I've read some threads by artists discussing where to go next, between DA, Instagram, ArtStation, and several other new and likely not much better platforms, and one comment that struck me was someone saying it was just not worth it, and their time was better spent networking offline at a gallery.
AI art might actually kill online art communities.
We've taken the Internet for granted as grandma and grandpa joined it. Tomorrow people may just get sick of all these algorithms, let go of their smartphones, and go touch some grass. Then every website is just going to be AI bots regurgitating each others' content ad nauseum.
Humans are on the web because of the reach. If AI-generated content steals all the reach, why would anyone post anything on publicly accessible venues instead of just using private ones?
"A new social contract is inevitable because the only way to prevent data from being mined is to not produce it to begin with." But is it though? You are assuming that "not produce it to begin with" is impossible. I'm afraid it's not impossible and the web experiment is at real danger. Maybe not immediately, but will it survive another 20 years in this environment?
If I post something publically I'm fine with everyone being able to use it for AI and other data mining. But I'm not ok for a single company only to benefit. And definitely not to sell my public data. I'm looking at you reddit.
The proposal in the article, however, is not about "the commons", it's about content that the users themselves produced, and then they voluntarily gave permission to Meta to use.
Or are you saying that if I produce some type of material, I shouldn't be able to license it for someone else to use it freely?
Voluntarily gave permission? Meta never asked anyone for permission to use the AI as training data! They just opted in every single user, via an update of their privacy policy. In order to opt out you have to discover that this is happening, find the help page with the protest form, write some prose about why this negatively affects you, and hope they acknowledge this.
I have done this, and my request has not even been answered.
Reforming the social contract around IP became infinitely difficult the moment normal people started calling it "intellectual property", forming a tangled mess of legal, moral, and ethical ideas in most people's minds.
Your compromise is exactly the situation I desire but seems untenable to most people.
> I speak only for myself but plenty of people seem to agree: I don’t mind big companies training on generally available data, I mind the IP-laundering.
This removes the big scary emotional part of the debate. Without this, it's weakened quite a bit.
A lot of that "commons" was the work of people who gave absolutely no permission for their data to be used like this, that is, creating a machine intended to compete with them. My hope is we collectively tell these freeloading ass holes to pound sand. Public does not mean limitless license.
> So how about the following compromise: promote innovation by liberalizing the posture around training on roughly “the commons”, but insist that the resulting weights are likewise available to the public.
How much would you personally invest in a startup which would spend billions of dollars on a compute cluster only to release the weights publicly after the training is complete?
I'm gladly investing portion of my wages to such efforts, alongside e.g. education, healthcare, childcare and infrastructure. And I don't even want any monetary ROI from my investment!
So your plan is to regulate it into complete commercial unviability, where the only source of funding were government bureaucrats? How often does this strategy pay off?
It only exists because it's commercially viable to print and sell books. Also, how much capital does a single library and a single frontier llm require?
He's making a good point. 100m was invested in stability. They are out of money with no clear path to product market fit after giving their model away for free. I'm personally ecstatic that they failed but it is worth asking about the value this technology is producing compared to the cost.
But as the various regulatory and judicial and legislative processes grind through different parts of the modern intellectual property issue made so abundantly legible by the modern AI training data gold rush it seems ever more clear that one way or another, we’re going to get a new social contract on IP.
Leaving aside for a moment the thicket of laws, precedents, jurisdictions, and regulatory inertia: we can vote with our feet as both customers and contributors for common sense now.
So how about the following compromise: promote innovation by liberalizing the posture around training on roughly “the commons”, but insist that the resulting weights are likewise available to the public. Why do I have to take someone’s word for it that they’ve got a result around superposition or whatever on mech interp? I’d like to see it work given it’s everyone’s data pushing those weights.
I speak only for myself but plenty of people seem to agree: I don’t mind big companies training on generally available data, I mind the IP-laundering. Compete on cost, compete on value-added software stacks, compete on vertical integration. There is lots of money to be made building a better mousetrap in terms of code and infrastructure and product innovation.
Conduct the research in the open. None of this would be possible without an ocean of research and data subsidized in whole or in part by the public. Asserting any form of ownership over the result might end up being legal, but it will never be ethical.
Meta isn’t perfect on this stuff, but they’re by far the actor pulling the conversation in that direction. Let’s encourage them to continue pushing the pace on stuff like LLaMA 3.