Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings.

[1]: https://arxiv.org/abs/1608.05859





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: