The type of prompt that asks it to invent a new language (e.g. use only these le...

johndough · on Sept 2, 2023

Most large language models these days are trained on "tokens" instead of characters. A token consists of multiple characters. This makes it extremely difficult to learn character-level tasks. So why use tokens instead of characters in the first place? The reason is that by using tokens, multiple characters can be generated at once, which makes training and text generation cheaper.

OpenAI has this website where you can see how text is decomposed into tokens: https://platform.openai.com/tokenizer

aeonik · on Sept 2, 2023

How is the set of tokens selected for various LLMs?

My intuition tells me there are important symbolic patterns in different layers of tokens. If they are automatically generated, I'd bet there are interesting insights to be gleaned in the tokeizer itself.

PeterisP · on Sept 2, 2023

They are automatically generated, the algorithms have a bunch of tricks, but essentially they merge together the most frequent token pairs until a desired fixed vocabulary size is reached.

So, for example (looking at GPT-3 tokenizations - you can test them at, for example, https://platform.openai.com/tokenizer) "517" is a single token, but "917" is two tokens; and there's no explicit link whatsoever between the token "517" and tokens "5" and "17" other than what can be learned from data. This works well enough for almost all tasks, but fails in edge cases like when someone makes up a toy challenge that asks how many fives are in a large number.

messe · on Sept 2, 2023

The token set (vocabulary) is usually generated by using Byte Pair Encoding on a corpus that you think represents your training set well.

BPE starts with a set of tokens consisting of single character tokens. Then the most frequent pairs of tokens are merged into single tokens and added to vocabulary. All occurrences of those pairs in the corpus are replaced with the new merged tokens. This process is repeated until the vocabulary is as large as you want it to be.

https://en.m.wikipedia.org/wiki/Byte_pair_encoding

ssnistfajen · on Sept 2, 2023

CJK characters are almost always split into multiple tokens per individual character. I'm not too familiar with Unicode mappings so it's interesting that the the outputs are still very coherent.