Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone know of a scraper that uses LLMs/natural language to build a deterministic, robust script that I can use to scrape the same site in the future? All of the natural language extractors I’ve seen so far need an LLM every time, but that seems unnecessary…


llm-scraper [1] does a decent job but it's still a bit fragile. The biggest problem I have is all the React CSS-in-JS libraries that use hashes in their class names, which the LLM isn't smart enough to ignore.

[1] https://github.com/mishushakov/llm-scraper


What have you had success doing with this? Curious to test it


I mostly use it to aggregate event calendars for all the concert/sport/etc venues, meetups, and clubs in my area and do some other scraping tasks. I host a little wrapper around llm-scraper on a DigitalOcean droplet that I call from Val.town scripts

I only check most places once a week so I use the LLM to do the scraping but there are a few cases where I have to scrape thousands of pages very frequently so I use the more deterministic script it generates instead.


Oh Im interested in doing something similiar, is it hard to do?


Great thanks!


Nice! Thanks!


We’ve built one internally using browser-use to generate playwright code

Works ok. Not as automated as I’d like


they are all quite bad




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: