A good way to understand why cosine similarity is so common in NLP is to think i...

A good way to understand why cosine similarity is so common in NLP is to think in terms of a keyword search. A bag-of-words vector represents a document as a sparse vector of its word counts; counting the number of occurrences of some set of query words is the dot product of the query vector with the document vector; normalizing for length gives you cosine similarity. If you have word embedding vectors instead of discrete words, you can think of the same game, just now the “count” of a word with another word is the similarity of the word embeddings instead of a 0/1. Finally, LLMs give sentence embeddings as weighted sums of contextual word vectors, so it’s all just fuzzy word counting again.