Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Scraping things that don't want to be scraped

If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.



Assuming that you eventually manage to load the page somehow. Which in some edge cases may entail simulating mouse movements and random delays.


Agreed. -> I use the ui.vision extension to simulate native mouse movements.


Have you tried on a page protected by cloudflare captcha?


Its funny I never seem to hit these infamous Clouflare captchas. The only impediment I encounter with Cloudflare is they require plaintext SNI to read their blog, https://blog.cloudflare.com. Unlike almost all other Cloudflare, ESNI will not work.


I have not had to deal with that, but I have idly thought that it might be easier to pipe the audio version into google assistant or something, and see what it comes up with.


It seems to be no problem if you automate a real browser as opposed to a headless browser. I think they test for that.


A browser extension is probably an easier way to extract text than OCR (unless you're targeting a wide range of sites, I suppose).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: