Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm more a vision person, and haven't looked a lot into NLP transformers, but is this because the attention is masked to only allow each query to look at keys/values from its own past? So when we are at token #5, then token #3's query cannot attend to token #4's info? And hence the previously computed attention values and activations remain the same and can be cached, because it would anyway be the same in the new forward pass?


Yep, that’s right!

If you want to be precise, there are “autoregressive transformers” and “bidirectional transformers”. Bidirectional is a lot more common in vision. In language models, you do see bidirectional models like Bert, but autoregressive is dominant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: