From the abstract I get the feeling these techniques are useful when you don’t have access to the corpus, as e.g. in the case where you download some open source weights but the corpus is secret. Otherwise I don’t understand why you wouldn’t just compute a histogram over the tokens in (a statistical sample of) the corpus.
The paper mentions some reasons why these quick fix ideas are not as simple as it sounds. For example many rare tokens are “intermediate” merges inside the BPE algorithm, shorter prefixes of longer words. The long word is common, but its earlier, intermediate merge is not, by itself.
Are there any specific reasons for using BPE, not Unigram, in LLMs? I've been trying to understand the impact of the tokenization algorithm, and Unigram was reported to be a better alternative (e.g., Byte Pair Encoding is Suboptimal for Language Model Pretraining: https://arxiv.org/abs/2004.03720). I understand that the unigram training process should eliminate under-trained tokens if trained on the same data as the LLM itself.
This is oxymoronic; the corpus is the "source". Yet this usage of "open source" is widespread. Maybe we should start calling such models by their rightful name, "freeware".
Freeware versus open source is a good point. But freeware typically can't be modified by the recipient, whereas downloadable models and open source code can. So I think there's still a need for a different term, neither open source nor freeware...
I would argue that the kind of modification you can do to a big blob of weights is more akin to fiddling with a binary in a hex editor than modifying source code. It is not the "preferred form" for the source, and you cannot cleanly and easily do things like modify its "alignment" - that is why people speak of "jailbreaking" these models. So I still think "freeware" works as a term.
No, the corpus is not the source. It's data. So we can have concepts of open models, open source, and open data. Any combination of these can be chosen independently.
(Open data and open model but not open source is a bit weird, but not unthinkable: there may be unreleased training tricks or specialized infrastructure such that the source code release is hard or undesirable.)