Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am doing a similar thing for technical documentation, basically i want to recommend some docs at the end of each document. I wanted to use the same approach you outlined to generate labels for each document and thus easily find some “further reading” to recommend for each.

How big should my sample size be to be representative ? It’s a fairly large list of docs across several products and deployment options. I wanted to pick a number of docs per product. Maybe I’ll skip the steps 4/5 as I only need to repeat it occasionally once I labelled everything once



If you're just generating labels from existing documents, you don't need that many data points, but the LLM may hallucinate labels if you have too few relative to the number of labels you want.

For training the model downstream, the main constraint on dataset size is how many distinct labels you want for your use case. The rules of thumb are:

a) ensuring that each label has a few samples

b) atleast N^2 data points total for N labels to avoid issues akin to the curse of dimensionality




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: