Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> “Pre-training as we know it will unquestionably end,” Sutskever said onstage.

> “We’ve achieved peak data and there’ll be no more.”

> During his NeurIPS talk, Sutskever said that, while he believes existing data can still take AI development farther, the industry is tapping out on new data to train on. This dynamic will, he said, eventually force a shift away from the way models are trained today. He compared the situation to fossil fuels: just as oil is a finite resource, the internet contains a finite amount of human-generated content.

> “We’ve achieved peak data and there’ll be no more,” according to Sutskever. “We have to deal with the data that we have. There’s only one internet.”

What will replace Internet data for training? Curated synthetic datasets?

There are massive proprietary datasets out there which people avoid using for training due to copyright concerns. But if you actually own one of those datasets, that resolves a lot of the legal issues with training on it.

For example, Getty has a massive image library. Training on it would risk Getty suing you. But what if Getty decides to use it to train their own AI? Similarly, what if News Corp decides to train an AI using its publishing assets (Wall Street Journal, HarperCollins, etc)?



> What will replace Internet data for training? Curated synthetic datasets?

My take is that the access Meta, Google etc. have to extra data has reduced the amount of research into using synthetic data because they have had such a surplus of it relative to everyone else.

For example, when I've done training of object detectors (quite out of date now) I used Blender 3D models, scripts to adjust parameters, and existing ML models to infer camera calibration and overlay orientation. This works amazingly well for subsequently identifying the real analogue of the object, and I know of people doing vehicle training in similar ways using game engines.

There were several surprising tactical details to all this which push the accuracy up dramatically and you don't see too widely discussed, like ensuring that things which are not relevant are properly randomized in the training set, such as the surface texture of the 3D models. (i.e. putting random fractal patterns on the object for training improves how robust the object detector is to disturbance in reality).


> What will replace Internet data for training? Curated synthetic datasets?

Perhaps a different take at this could be: if I wanted to train "state law" LLM that is exceedingly good in interpreting state law, what are the obstacles to download all the law and regulations material for given state that will allow me to train LLM such that it becomes 95th percentile of all law trainees and lawyers.

In that case, and my point is, that we already don't need an "Internet". We just need a sufficiently sized and curated domain-specific dataset and the result we can get is already scary. "State law" LLM was just an example but the same logic applies to basically any other domain - want a domain-specific (LLM) expert? Train it.


That's kind of going in a different direction. The big picture is that LLMs have until this point gotten better and better from larger datasets alone. See "The Bitter Lesson". But now we're running out of datasets and so the only way we know of to improve models' reasoning abilities and everything, is coming to an end.

You're talking about fine tuning, which yes is a technique that's being used and explored in different domains, but my understanding is it's not a very good way for models to acquire knowledge. Instead larger context windows and RAG works better for something like case law. Fine tuning works for things like giving models a certain "voice" in how they produce text, and general alignment things.

At least that's my understanding as an interested but not totally involved follower of this stuff.


A human being doesn’t need to read the entire internet to pass the state bar.

Seems to me that need new ideas?


you need context for the dry statutes

sure, you download all the legal arguments, and hope that putting all this on top of a general LLM which has enough context to deal with usual human, American, contemporary stuff

the argument is that it's not really enough for the next jump (as it would need "exponentially" more data) as far as a I understand


I don't understand the limitation, e.g. how much data do you need to train the "law state" specific LLM that doesn't know anything else but that?

Such LLM does not need to have 400B of parameters since it's not a general knowledge LLM but perhaps I'm wrong on this (?). So my point rather is that it may very well be, let's take for example, a 30B parameters LLM which in turn means that we might have just enough data to train it. Larger contexts in smaller models are a solved problem.


> how much data do you need to train the "law state" specific LLM that doesn't know anything else but that?

Law doesn’t exist in a vacuum. You can’t have a useful LLM for state law that doesn’t have an exceptional grounding in real world objects of mechanics.

You could force a bright young child to memorize a large text, but without a strong general model of the world, they’re just regurgitating words rather than able to reason about it.


Counter-argument: code does not exist in vacuum yet we have small and mid-sized LLMs that can already output reasonable code.


I'm going to push back on "produce reasonable code".

I've seen reasonable code written by AI, and also code that looks reasonable but contains bugs and logic errors that can be found if you're an expert in that type of code.

In other words, I don't think we can rely solely on AI to write code.


I've seen a lot of code written by humans that "looks reasonable but contains bugs and logic errors that can be found if you're an expert in that type of code".


What does that have to do with the state of the AI?


Generally they’ve been distilled from much larger models, but also, code is a much smaller domain than the law.


Code is both much smaller as a domain and less prone to the chaos of human interpretation. There are many factors that go into why a given civil or criminal case in court turns out how it does, and often the biggest one is not "was it legal". Giving a computer access to the full written history of cases doesn't give you any of the context of why those cases turned out. A judge or jury isn't going to include in the written record that they just really didn't like one of the lawyers. Or that the case settled because one of the parties just couldn't afford to keep going. Or that one party or the other destroyed/withheld evidence.

Generally speaking, your compiler won't just decide not to work as expected. Tons of legal decisions don't actually follow the law as written. Or even the precedent set by other courts. And that's even assuming the law and precedent are remotely clear in the first place.


A model that's trained on legal decisions can still be used to explore these questions, though. The model may end up being uncertain about which way the case will go, or even more strikingly, it may be confident about the outcome of a case that then is decided differently, and you can try and figure out what's going on with such cases.


But what value does that have? The difference between a armchair lawyer and a real actual lawyer is in knowing when something is legal/illegal, but unlikely to be seen that way in a court or brought to a favorable verdict. It's knowing which cases you can actually win, and how much it'll cost and why.

Most of that is not in scope of what an LLM could be trained on, or even what an LLM would be good at. What you're training in that case would be someone who's an opinion columnist or twitter poster. Not an actual lawyer.


The point is not in replacing all of the lawyers or programmers but rather that we will no longer need so many of them since a lot of their expertise is becoming a commodity today. This is a fact and there have been many many examples of that.

My friend who hasn't been trained for SQL, nor computer science at all, is now all of the sudden able to crunch through complex SQL queries because of the help he gets through LLMs. He, or more specifically his company, does not need to hire an extern SQL expert anymore since he can manage it himself. He will probably not write a perfect SQL but it's going to be more than good enough and that's actually all that it matters.

The same thing happened at much much smaller scale with Google Translate. 10 years ago we weren't able to read foreign language content. Today? It's not even a click-away because Chrome is doing it for you automatically so it has become a commodity to go and read any website we wish to.

So, the history already proved us that "real translators" and "real SQL experts" and "real XY experts" have been already replaced by their "armchair" alternatives.


But that ignores that the stakes of law are high enough that you often cannot afford to be wrong.

30 years ago, the alternative to Google Translate was buying a translation dictionary or hiring a professional, neither of which was things you'd do for something you didn't care much about. Yes, I can go look at a site/article that's in a language I don't speak and get it translated and generally get the idea of what it's saying. If I'm just trying to look at a restaurant's menu in another language, I'm probably fine. I probably wouldn't trust it if I had serious food allergies, or was trying to translate what I could legally take through customs. If you're having a business meeting about something, you're probably still hiring a real human translator.

Yes, stuff has become commodity-level, but that just broadens who can use it, assuming they can afford for it to be wrong, and for them to have no recourse if it is. Google Translate won't pay your hospital bills if you rely on it to know there aren't allergens in your food and it mistranslated things. ChatGPT won't do the overtime to fix the DB if it gives you a SQL command that accidentally truncates the entire Dev environment.

Almost everything around law on most countries doesn't have "casual usage" where you can afford to be wrong. Even the most casual stuff you may go to a lawyer about, such as setting up a will, is still something where if you try to just do it yourself, you can create a huge legal mess. I've known friends whose relatives "did their own research" and wrote their own wills and when they died, most of their estate's value was consumed in legal issues trying to resolve it.

As I said before - a legal LLM may be fine for writing opinion pieces or informing arguing on the internet, but messing up even basic stuff about the law can be insanely costly if it ends up mattering, and most people won't know what will end up mattering. Lawyers bill hundreds an hour, and bailing you out of decisions you made an LLM-deluded mess could easily take tens of hours.


The stakes of deploying a buggy code into the data center production code can easily cost millions of $$$ and yet we still see that one of the primary usages of LLMs today is exactly in the software engineering. Accountability exists in every domain so such argument doesn't make law any different than anything else. You will still have an actual human signing off the law interpretation or code pull-request. It will just happen that we will not going to need 10 people for that job but 1. And this is at this point I believe inevitable.


legal reasoning involves applying facts to the law, and it needs knowledge of the world. the expertise of a professional is in picking the right/winning path based on their study of the law, the facts and their real world training. money is in codifying that to teach models to do the same


> code is a much smaller domain than the law

I agree, but I'd add – code as a domain is a lot more vast than any AI can currently handle.

AIs do well on mainstream languages for which there is lots of open source code examples available.

I doubt they'd do so well on some obscure proprietary legacy language. For example, large chunks of the IBM i minicomputer operating system (formerly known as OS/400) are still written in two proprietary PL/I dialects, PL/MI and PL/MP. Both languages are proprietary – the compiler, the documentation, and the code bases are all IBM confidential, nobody outside of IBM is getting to see them (except just maybe under an NDA if you pay $$$$). I wonder how good an AI would go on that code base? I think it would have little hope unless IBM specifically fine-tuned an AI for those languages based on their internal documentation and code.


> unless IBM specifically fine-tuned an AI for those languages based on their internal documentation and code.

Why do you think this isn't already or won't be a case in the near future? Because that's exactly what I believe is going to happen given the current state and advancements of LLMs. There's certainly a large incentive from IBM to do so.


> code is a much smaller domain than the law.

Law of an average EU country fits in several hundred, let's say even thousands, of pages of text. Specification. Very well known. Low frequency of updates. But code? Everything opposite so I am not sure I could agree on this point at all.


Right, but you're missing the point here that interpreting the law requires someone with a law degree, and all the real-world context that they have, and all the subtle knowledge about what things mean.

The Bible is also a short and well-known text, but if I want to answer religious questions for observant Christians, I can't just train it on that. You need a deep real world context to understand that "my buddy made SWE II and I'm only SWE I and it's eating me up" is about the biblical notion of covetousness.


And then I guess you're also missing the point that interpreting and writing the code also requires an expert and that in that respect it is no different than law. I could argue that engineering is more complex than interpreting law but that's not the point now. Subtleties and ambiguity are large in both domains. So, I don't see the point you're trying to make. We can agree to disagree I guess.


For a “legal LLM” you need three things: general IQ / common sense at a substantially higher level than current, understanding of the specific rules, and hallucination-free recall of the relevant legal facts/cases.

I think it’s reasonable to assume you can get 2/3 with a small corpus IF you have an IQ 150 AGI. Empirically the current known method for increasing IQ is to make the model bigger.

Part of what you’re getting at is possible though, once you have the big model you can distill it down to a smaller number of parameters without losing much capability in your chosen narrow domain. So you forget physics and sports but remain good at law. That doesn’t help you with improving the capability frontier though.


And then your Juris R. Genius gets a new case about two Red Socks fans getting into a fight and without missing a beat starts blabbering about how overdosing on too much red pigments from the undergarments caused their rage!


Yeah, I think for the highest-value activities (eg legal advice) you expect to run the full frontier model.

But maybe you want to run a smaller one locally on your iPhone for privacy and accept the capability loss.


The problem remains the size of the dataset. You aren't going to get large enough datasets in these specific domains.


the big frontier models already have all laws, regulations and cases memorized/trained on given they are public. the real advancement is in experts codifying their expertise/reasoning for models to learn from. legal is no different from other fields in this.


So, fine-tuning the model to the law of the exact country. Or fine-tuning the model to the problem space of the exact codebase/product. You hire 10 law experts instead of 100. Or you hire 10 programmers instead of 100. Expertise is becoming a commodity I'm afraid.


I think we're not close to running out of training data. It's just that we would like knowledge, but not necessary behavior of said texts. LLMs are very bad at recalling popular memes (known by any seasoned netizen) if they had no press coverage. Maybe training with 4chan isn't as pointless if you could make it memorize it, but not imitate it.

Also, what about movie scripts and song lyrics? Transcripts of well known YouTube videos? Hell, television programs even.


All the publicly accessible sources you mentioned have already been scraped or licensed to avoid legal issues. This is why it’s often said, “there’s no public data left to train on.”

For evidence of this, consider observing non-English-speaking young children (ages 2–6) using ChatGPT’s voice mode. The multimodal model frequently interprets a significant portion of their speech as “thank you for watching my video,” reflecting child-like patterns learned from YouTube videos.


We've run out of training data that definitely did not contain LLM outputs.


What about non-text modalities - image and video, specifically?


video is probably still fine, but images sourced from the internet now contain a massive amount of AI slop.

It seems, for example, that many newsletters, blogs etc resort to using AI-generated images to give some color to their writings (which is something I too intended to do, before realizing how annoyed I am by it)


Humans doesn't need trillions of tokens to reason or ability to know what they know. While a certain part of it comes from evolution, I think we have already matched the part that came from evolution using internet data, like basic language skills, basic world modelling. Current pretraining takes lot more data than a human would, and you don't need to look into all Getty images to draw a picture and so would a self aware/improving model(whatever that means).

To reach expert level in any field, just training next tokens for internet data or any data is not the solution.


> Humans doesn't need trillions of tokens

I wonder about that. we can fine tune on calculus with much fewer tokens, but I'd be interested in some calculations of how many tokens evolution provides us (it's not about the DNA itself, but all the other things that were explored and discarded and are now out of reach) - but also the sheer amount of physics learnt by a baby by crawling around and putting everything in its mouth.


Yes, as I said in the last comment. With current training techniques, one internet data is enough to give models what is given by evolution. For further training, I believe we would need different techniques to make the model self aware about its knowledge.

Also, I believe a person who is blind and paralyzed for life could still attain knowledge if educated well enough.(can't find any study here tbh)


yeah blind and paralysed from birth - I'm doubtful that hearing along would give you the physics training. although if it can be done, then it means the evolutionary pre-training is even more impressive.


> Humans doesn't need trillions of tokens to reason or ability to know what they know.

It seems to me by the time we’re 5-6 we’ve likely already been exposed to trillions of tokens. Just think of how many hours of video and audio tokens have already come to your brain by that point. We also have constant input from senses like touch and proprioception that help shape our understanding of the world around us.

I think there are plenty more tokens freely available out in the world. We just haven’t figured out how to have machines capture them yet.


The ones that stand out to me are industries like pharmaceuticals and energy exploration, where the data silos are the point of their (assumed) competitive advantages. Why even the playing by opening up those datasets when keeping them closed locks in potential discoveries? Open data is the basis of the Internet. But whole industries are based on keeping discoveries closely guarded for decades.


Synthetic datasets are useless (other than for very specific purposes, such as enforcing known strong priors, and even then it's way better to do it directly by changing the architecture). You're better off spending that compute by making multiple passes over the data you do have.


This is contrary to what the big AI labs have found. Synthetic data is the new game in town.


Ilya is saying it doesn’t work in this talk apparently.


Ilya could be wrong. I don’t think the question is decided yet in general. We already know that in lots of fields fake data can be used in ways that are as useful as or even more useful than the real thing[1], but in my understanding that tends to be situations where we have an objective function that is unambiguous and known beforehand. Meta has some very impressive work on synthetic data for training and my (uninformed) read was that is the state of the art in eg voice recognition at the moment.[2]

[1] Eg Sobel sequences in a monte carlo simulation instead of real random numbers. They allow better coverage of the space of a simulation from fewer paths. https://www.sciencedirect.com/science/article/abs/pii/004155...

[2] Seems a good overview is https://arxiv.org/html/2404.07503


Most priors are not encodable as architecture though, or only partially.


I think this will be the one thing that causes Google to revive its plan to scan all books in existence. They had started it, and built the machines to do it, and were making good progress... until Copyright hit them. BUT if they're not making the full text publicly accessible, and are "only" training AI on it, who knows if that would still be a problem. It's definitely a vast treasure trove of information, often with citations, and (presumably) hyper-linkable sources.


I wonder if we will see (or already are/have been seeing) the XR/smart glasses space heat up. Seems eventually like a great way to generate and hoover up massive amounts of fresh training data.


Robots can acquire data on their own (hopefully not via human dissection)


This is indeed what I thought he was saying, AI needs to dynamically learn, just training on static data sets is no longer enough to advance. So continuous learning is the future, and the best source of data for continuous learning is people themselves. Don't know what form that might take, instrumenting lots of people with sensors? Robots interacting with people? Self driving cars learning from human drivers? (already happening) Ingesting all video from surveillance cameras? Whatever form the input data takes, continuous learning would be an advance in high level AI. There's been work on it over the decades, not sure how that work might relate to recent LLM work.


Perhaps these models will be more personalized and there will be more personal data collection.

I am currently building a platform for heavy personal data collection including a keylogger, logging mouse positions and window focus, screenshots à la recall, open browser tabs and much more. My goal is to gather data now that may become useful later. It's mainly for personal use but I'd he surprised if e.g. iphones weren't headed in the same direction.


> There are massive proprietary datasets out there which people avoid using for training due to copyright concerns.

The main legal concern is their unwillingness to pay to access these datasets.


Yup, there's also a huge amount of copyright-free, public domain content on the Internet which just has to be transcribed, and would provide plenty of valuable training to a LLM on all sorts of varied language use. (Then you could use RAG over some trusted set of data to provide the bare "facts" that the LLM is supposed to be talking about.) But guess what, writing down that content accurately from scans costs money (and no, existing OCR is nowhere near good enough), so the job is left to purely volunteer efforts.


Not sure if this was a good example. Getty already license their images to Nvidia.

And they already have a generative image service... I believe it's power by Nvidia model.


I always suspected that bots on Reddit were used to gain karma and then eventually sell the account, but maybe they're also being used for some kind of RLHF.


> What will replace Internet data for training?

It means unlimited scaling with Transformer LLM is over. They need a new architecture that scales better. Internet data respawns when they click [New Game...], oil analogy is an analogy and not a fact, but anyways total amount available in a single game is finite so combustion efficiency matters.


> just as oil is a finite resource, the internet contains a finite amount of human-generated content.

I guess now they’re being explicit about the blatantly extractive nature of these businesses and their models.


> What will replace Internet data for training? Curated synthetic datasets?

Enter Neuralink


Really not sure what you mean by this, could you explain?


AI can just suck up the content of peoples brains for training data.


You need to go back to Twitter with low-quality posts like this.


Yeah, people will go crazy for GPT-o2 trained on the readings of sensors "barely embedded" in the brains tortured monkeys, for sure.

EDIT: This comment may have been a bit too sassy. I get the thought behind the original comment, but I personally question the direction and premise of the Neuralink project, and know I am not alone in that regard. That being said, taking a step back, there for sure are plenty of rich data sources for non-text multimodal data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: