Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?


sentence embedding models like all-MiniLM-L6-v2 [1], bge-m3 [2]

[1] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

[2] https://huggingface.co/BAAI/bge-m3

In my recent project I used openai's embedding model for that because of its convenient api and low cost.


Model embedding models (particulaly those with context windows of 2048+ tokens) allow you to YOLO and just plop the entire text blob into it and you can still get meaningful vectors.

Formatting the input text to have a consistent schema is optional but recommended to get better comparisons between vectors.


sentence embedding models are great for this type of thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: