When I write the scraper, I literally can't write it to account for the API for every single website! BUT I can write how to parse HTML universally, so it is better to find a way to cache your website's HTML so you're not bombarded, rather than write an API and hope companies will spend time implementing it!
If you are writing a scraper it behooves you to understand the website that you are scraping. WordPress websites, like that the author is discussing, provide such an API out of the box. And like all WordPress features, this feature is hardly ever disabled or altered by the website administrators.
And identifying a WordPress website is very easy by looking at the HTML. Anybody experienced in writing web scrapers has encountered it many times.
> If you are writing a scraper it behooves you to understand the website that you are scraping.
That’s what semantic markup is for? No? H1…n:s, article:s, nav:s, footer:s (and microdata even) and all that helps both machines and humans to understand what parts of the content to care about in certain contexts.
Why treat certain CMS:s different when we have the common standard format HTML?
What if your target isn't any WordPress website, but any website?
It's simply not possible to carefully craft a scraper for every website on the entire internet.
Whether or not one should scrape all possible websites is a separate question. But if that is one's goal, the one and only practical way is to just consume HTML straight.
Web scrapers are typically custom written to fit the site they are scraping. Very few motor vehicles are commissioned for a specific purchaser - fewer still to the design of that purchaser.
Can you though? Because even big companies rarely manage to do so - as a concrete example, neither Apple nor Mozilla apparently has sufficient resources to produce a reader mode that can reliably find the correct content elements in arbitrary HTML pages.