It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, G...

		woadwarrior01 5 days ago \| parent \| context \| favorite \| on: LLM from scratch, part 28 – training a base model ... It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings. [1]: https://arxiv.org/abs/1608.05859