netcat+mkfifo does the actual transfer with TCP - so there's no tunneling involved. This solution tunnels tcp through file abstraction, so it works significantly differently.
As for the uses, see the three author listed. For example, by using a file share as transport you may evade firewall.
Is there a reason why you don't? Is it just general bitterness and cynicism? As far as I know all major search engines respect rebots.txt, I don't see why LLM scrappers would be different.
probably, yes. But coming from LLM scrappers, I have absolutely no faith in any of them. When one of them calls themself "open" in their name and is anything but, why would I trust them for anything after they lie in their name?
I also do not trust Google only crawls what is allowed in robots.txt. Maybe they only use the data allowed in public use, but I have no faith that they don't have crawled data in their version of shadow profiles.
I do not trust bigTech at all, and for those that do, I really don't understand why you do.
>How can corporations be stealing anything from an open source project?
The code is published using some license that allows some use cases and prohibits other. For example GPL is famous for being viral. Using it to teach a LLM that spits "unlicensed" code is basically laundering copyright.
Using it to train an LLM seems orthogonal to the output of the LLM. For instance, they could have their LLM include a link to the license. Merely training an LLM on the data does not seem to be against the spirit of GPL or Apache license.
Someone could easily create a such a license. Free to use and distribute, $10,000 per line used for AI model training.
I'll very naively assume that Amazon, OpenAI, Google and others check licenses before feeding data to their models. I'll stop assuming that when one of these companies admit that they don't actually care and it's not profitable for them to respect licenses.
To make that enforceable, it would be nice to prove the AI was trained on it.
You might insert a "sleeper/activator" pair. The sleeper is a watermark that the AI will recall verbatim. To make it provide the sleeper, we give the AI a special activator prompt.
Demonstrating that your public repo successfully poisoned the AI with a watermark could become a court admissible proof of unauthorized scraping.
The LLM is quite literally a derivative work of GPL code. At the very least, there is an argument in such a case that the derivative function (the model weights) should conform to the same license.
I've heard AI advocates talk about a "right to read" or "right to learn"; meaning that we have the right to read something and then internalize it and use it. Therefore, why shouldn't an AI have the same right? The difference to me seems to be that the AI has the ability to regurgitate it in whole.
I can read a book, learn about the concepts, then use or repeat those concepts. The AI can do the same. But is it really "learning"? It may be just spewing out pieces of the content without any understanding. In which case it's a copyright violation, right?
Let's assume that both humans and AI can produce statements that are new and useful, and can both produce statements that violate copyright. For example a human can operate an illegal a video tube website where they serve verbatim copies of copyrighted movies.
I'll argue that's not enough reason to grant the AI the right to learn from copyrighted materials, because the right to learn is intimately wrapped up in human needs, while AI rights are focused on corporate and societal needs, which are currently being decided.
The human right to learn
You're a human and you need the right to learn from copyrighted material in order to not suffer Ignorance, in order to serve Society, because it's not feasible to charge you a rent for ideas you get from a book, and because it would cause suffering and indignity if we tried to charge you for your own thoughts.
With an AI, it's less clear it needs the right to learn from copyrighted material, because it's not a person that can suffer, and because the scale of its usage of copyrighted materials - and its potential harm to copyright holders - is about 5 orders of magnitude greater than that of any single person, and is potentially greater than the collective impact of human learners.
Let's lay out the reasoning:
1. No AI Suffering (yet). The AI doesn't suffer from ignorance and isn't (yet) a real person. So it needs no personal right to learn.
2. Potential Social Harm. AI could pose a much greater threat to copyright holders than the sum total of all human learners. We'll be weighing this potential in court, and it's currently not clear how the matter will be decided. Copyright holders could be awarded protections against corporations training AIs.
3. Ease Of Accounting. AIs and their training materials can be audited, unlike a human mind. So we have a technical means to restrict the AI's ability to learn from copyrighted materials.
4. No Harm in Accounting. Since the AI is not yet a person, and suffers no indignity or invasion of privacy from being audited, it's safe to audit and regulate the AI's training materials.
In summary it's important to remember that human rights exist because humans need those rights to enjoy life in a dignified way as persons, and because those rights benefit Society.
When we decide the question of AI rights, it's important to remember it's not a person, and any rights it has will be provided on the basis of societal benefit alone. It's not yet clear which AI rights will benefit Society here. It's quite possible that we will strengthen copyrights against unlicensed AI use, at least to some degree beyond the current "free-for-all".
You need to do more than include a link to the license to comply. You need to include the entire source code needed to compile the derived system.
For an LLM that would include:
1. Training data
2. Training code and metrics
3. Hyperparameter settings
4. Output weights
Anything less is really just misinterpretation of the nature of open source's provision for studying, modifying, and recompiling the LLM
Tldr; these companies MUST make the LLM into AGPL and provide all necessary codes as described above. Companies that refuse this will be raided by open source copyright trolls, if we're lucky and a little mischievous.
I like that it immediately assumed the US, even though nothing in your question suggested it. I love that all LLMs have a strong US centric bias.
Btw I'm not personally a lawyer, but I've heard that GPT is especially prone to mixing laws across the borders - for example you ask a law question in language X, and get a response that uses a law from a country Y - and it's extremally convincing doing that (unless you're a lawyer, I guess).
ChatGPT has user-customizable "instructions", and mine are set to tell it where I live. Any user can do the same, so that it will not make incorrect assumptions for you.
You might increase the probability of getting a correct answer for your region, but imo you decrease your awareness to allucination.
Overall you can still get a wrong answer
Well, except the number of English speakers outside the US is much larger than inside the US (as per the wikipedia page you point to) by 5 to 1. Granted many folks are speaking it as their 2nd (or nth) language. But when you take into account the limited set of languages supported by ChatGPT one could reasonably assume English-speaking (typing) users of ChatGPT are from outside the U.S. as non-U.S. folks are in the majority of 'folks for whom english would be their first option when interacting with ChatGPT'. Even if you only count India, Nigeria and Pakistan.
Though of course OpenAI can tell (frequently, roughly) where folks are coming from geographically and could (does?) take that into account.
Indeed - the US is a very large country, and consists of over 50 different jurisdictions, each with their own slightly different laws. An answer to a legal question which is correct in one state will often be subtly incorrect in another, and completely wrong in yet another.
But there are safety concerns, and it's not possible to go 1000km/h on the highway. So there's not much point in having a 1000km/h car, because you will never use its full potential. Just like a 1khz display (supposedly).
The analogy is strained and no longer applicable. Sure, you won't be running Cyberpunk 2077 at 1 kHz, but you'll use a 1 kHz display to its full potential of imperceptible latency and eliminated motion blur in regular desktop use, with every movement of your mouse and scroll of its wheel.
Even in gaming, less demanding games can hit 1000 Hz and console emulators will benefit from reduced latency too; you could actually beat CRT latency.
Well, maybe I'm biased - I've tested a 120Hz monitor recently and I legitimately didn't see a huge difference between 60Hz. Maybe in mouse trails when I was intentionally looking for the difference. I can't imagine how insignificant the change to 1Khz would be for me. But I'm not a gamer and I've spent my whole life on screens like this, so maybe I'm just blind and other people are much more sensitive to this - in this case I agree my analogy doesn't make sense.
I didn't get the impression from the blog post that island evacuation happens every few weeks. In fact, "first" suggest something opposite of that is true. Can you explain why you consider this even normal, obvious and happening regularly?
You misread OP, it's not that it happens every few weeks, but watching the coastline can hint at the trend over geologic time of islands appearing and disappearing
I think with the intent of downplaying the rising-sea-levels angle, since islands can also erode or sink
It's not a perfect answer, and one can "not participate" in more ways than one. Also, it depends on the individual and what makes sense for them and their family.
It's also a case of "picking your poison". So if the recent war wasn't happening, I'd be moving straight to Russia as right now it seems to be the sanest when it comes to these specific things. But that comes with dealing with some of the "negatives" as I'm sure you all can imagine.
If money wasn't a problem, personally I'd fund some sort of island or floating-island community to promote self-sustainable practices whilst staying as far as practically possible from the craziness going on in the west.
I assume parent's post is sarcasm, but I'm not 100% sure - I know there are people who hold similar views unironically.