The paper mentions some reasons why these quick fix ideas are not as simple as it sounds. For example many rare tokens are “intermediate” merges inside the BPE algorithm, shorter prefixes of longer words. The long word is common, but its earlier, intermediate merge is not, by itself.
Are there any specific reasons for using BPE, not Unigram, in LLMs? I've been trying to understand the impact of the tokenization algorithm, and Unigram was reported to be a better alternative (e.g., Byte Pair Encoding is Suboptimal for Language Model Pretraining: https://arxiv.org/abs/2004.03720). I understand that the unigram training process should eliminate under-trained tokens if trained on the same data as the LLM itself.