Given this was published in 2010, and https://en.wikipedia.org/wiki/Word2vec was...

visarga · on May 3, 2024

I worked for this task for a year and it doesn't work very well because in embedding space relatedness, synonymy and antonymy are mixed up and require pairwise thresholding. You can probably get to 90% but not 99% this way. Better use a crossentropy approach.

In modern RAG applications we return top-k results for this reason - it can't simply give the correct snippet in one result, leaving the hard part to the LLM to make sense what is useful and what is not.