Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When I write the scraper, I literally can't write it to account for the API for every single website! BUT I can write how to parse HTML universally, so it is better to find a way to cache your website's HTML so you're not bombarded, rather than write an API and hope companies will spend time implementing it!




If you are writing a scraper it behooves you to understand the website that you are scraping. WordPress websites, like that the author is discussing, provide such an API out of the box. And like all WordPress features, this feature is hardly ever disabled or altered by the website administrators.

And identifying a WordPress website is very easy by looking at the HTML. Anybody experienced in writing web scrapers has encountered it many times.


> If you are writing a scraper it behooves you to understand the website that you are scraping.

That’s what semantic markup is for? No? H1…n:s, article:s, nav:s, footer:s (and microdata even) and all that helps both machines and humans to understand what parts of the content to care about in certain contexts.

Why treat certain CMS:s different when we have the common standard format HTML?


What if your target isn't any WordPress website, but any website?

It's simply not possible to carefully craft a scraper for every website on the entire internet.

Whether or not one should scrape all possible websites is a separate question. But if that is one's goal, the one and only practical way is to just consume HTML straight.


If you are designing a car, it behooves you to understand the driveway of your car's purchaser.

Web scrapers are typically custom written to fit the site they are scraping. Very few motor vehicles are commissioned for a specific purchaser - fewer still to the design of that purchaser.

I have a hard time believing that the scrapers that are feeding data into the big AI companies are custom-written on a per-page basis.

WordPress is common enough that it's worth special-casing.

WordPress, MediaWiki, and a few other CMSes are worth implementing special support for just so scraping doesn't take so long!


> BUT I can write how to parse HTML universally

Can you though? Because even big companies rarely manage to do so - as a concrete example, neither Apple nor Mozilla apparently has sufficient resources to produce a reader mode that can reliably find the correct content elements in arbitrary HTML pages.


Oh, it is my responsibility to work around YOUR preferred way of doing things, when I have zero benefit from it?

Maybe I just get your scraper's IP range and start poisoning it with junk instead?


> so it is better to find a way to cache your website's HTML so you're not bombarded

Of course, scrapers should identify themselves and then respect robots.txt.


Why is figuring out what UI elements to capture so much harder than just looking at the network activity to figure what API calls you need?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: