"open source" means there should be a script that downloads all the training materials and then spins up a pipeline that trains end to end.
i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."
They'd have to publish or link the training data, which is full of copyrighted material. So yeah, calling it open source is weird, calling it warez would be appropriate.
I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product and doesn't give anything of worth to people who want to actually use the model because they have no means to train it. Hell most people can barely scrape up enough hardware to even run inference.
Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.
Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...
> I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product
That same argument can be applied to open-source (non-model) software, and is about as true there. It comes down to the business model. If anything, crating a closed-sourced copy of a piece of FOSS software is easier than an AI model since running a compiler doesn't cost millions of dollars.
it still doesn't sit right. sure it's different in terms of mutability from say, compiled software programs, but it still remains not end to end reproducible and available for inspection.
these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.
It's not confusing, no one is really confused except the people upset that the meaning is different in a different context.
On top of that, in many cases a company/group/whoever can't even reproduce the model themselves. There are lots of sources of non-determinism even if folks are doing things in a very buttoned up manner. And, when you are training on trillions of tokens, you are likely training on some awful sounding stuff - "Facebook is trained llama 4 on nazi propaganda!" is not what they want to see published.
i disagree. words matter. the whole point of open source is that anyone can look and see exactly how the sausage is made. that is the point. that is why the word "open" is used.
...and sure, compiling gcc is nondeterministic too, but i can still inspect the complete source from where it comes because it is open source, which means that all of the source materials are available for inspection.
The point of open source in software is as you say. It's just not the same thing though. Using words and phrases differently in different fields is common.
I agree that they should say "open weight" instead of "open source" when that's what they mean, but it might take some time for people to understand that it's not the same thing exactly and we should allow some slack for that.
no. truly open source models are wonderful and remarkable things that truly move the needle in education, understanding, distributed collaboration and the advancement of the state of the art. redefinition of the terminology reduces incentive to strive for the wonderful goal that they represent.
There is a big difference between open source for something like the linux kernel or gcc where anyone with a home PC can build it, and any non-trivial LLM where it takes cloud compute and costs a lot to train it. No hobbyist or educational institution is going to be paying for million dollar training runs, probably not even thousand dollar ones.
"too big to share." nope. sharing the finished soup base, even if well suited for inclusion in other recipes, is still different from sharing the complete recipe. sharing the complete recipe encourages innovation in soup bases, including bringing the cost down for making them from scratch.
There is an enormous amount of information in the public domain about building models. In fact, once you get into the weeds you'll realize there is too much and in many cases (not all, but many) the very specific way something was done or what framework they used or what hardware configuration they had was just a function of what they have or have experience with etc. One could spend a lifetime just trying to repro olmo's work or a lot of the huggingface stuff....
You can fine tune without the original training data, which for a large LLM is typically going to mean using LoRA - keeping the original weights unchanged and adding separate fine-tuning weights.
Yeah, but "open weights" never seems to have taken off as a better description, and even if you did have the training data + recipe, the compute cost makes training it yourself totally impractical.
The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.
i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."