As the owner of a large website, I don't care what you think. I block by default and whitelist when I decide it's in my interest.
If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.
Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.
As the user of large websites I don't care. I'm not going to read the TOS and I will continue to scrape what I like since it makes my life more convenient. Like OP when blocked I'll just drive my scraping through a web browser which is the same as I've done for years on various sites that never provided APIs.
"As the user of large websites I don't care". Are you sure ? Do you want your OK Cupid or LinkedIn profile to be crossposted on another website without your knowledge.
Putting it behind a signup page with terms that don't allow sharing is not "making it public".
And while in the US that may "just" be treated as unauthorized access, in the EU, if you make the data public it's also a violation of the Data Protection Directive, putting you at risk of prosecution in every EU country from which you have included data.
You may be right from a risk minimisation perspective. But for a lot of data the risk in the case of exposure is low enough that it is a totally valid risk management strategy to assume that legal protections will be a sufficient deterrent to prevent enough of the most blatant abuses.
Eh, not really. The Data Protection Directive doesn’t even apply here – if the first party (OKCupid) made it available to a third party (the scraper), then the first party can be held in violation, but not the third party.
If you have control of personally identifiable data, it's likely that at least some of the EU data protection rules will apply to you regardless of how you got it.
Yes but as you say, they apply regardless. More specifically, they apply to data that you have (and are storing), not the act of obtaining it.
As a private individual it's not hard to comply either, for private use. If you publish it, it becomes a different story, because it's PII. And, as soon as it's in possession of a company, they need to comply with more rules about securely storing it, etc. (this isn't enforced very well, though). Private individuals can't be held to that because there's (in theory) no legal way to check it.
If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.
Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.