> But if you are automating your exact actions that happen via a browser, can this be blocked?
Yes, by checking times between actions and number of actions in a time period, and blocking atypical activity. I was IP banned from a site once for a few months, after trying to scrape it too much and hitting links on the site that were hidden from humans.
The random wait settings specified in the post are better than nothing, but still too flimsy. You would need to put hours between requests, only request during a certain 15 hour periods, take days off, and eventually you aren't scraping regularly enough to do much good.
Scraping is not an API, and I should know- I used to do it for a living. Its unreliable. It requires constant maintenance. APIs can break too, but they are meant for the sort of consumption you are trying for.
If you scrape for a living, only do it as a side job.
It really depends on the data you are scraping. My main business relies on scraping and my data mining application has been running for over 5 years. If you have enough IP addresses available to you, it becomes almost impossible to distinguish it from normal users hitting the site...and bandwidth has gotten so cheap, the overhead is very affordable.
I've noticed that most sites actually don't change that often. I deal with changes once or twice every 3 months.
"If you scrape for a living, only do it as a side job."
This is true if you are scraping the low hanging fruit. I scrape 40+ sources (I do have access to a few APIs as well) and then have to extract the patterns/data I need to then integrate it into my business model. This is all automatic now and I only work on upgrading for speed and efficiency.
If you have to scan millions of urls daily from 1 site, it's probably not going to work out. You need to figure out clever ways of getting the data and using it without breaking any laws or pissing off the site owner.
Not scraping but banks don't even do this for their security which I found surprising. I just finished building a chrome extension (https://chrome.google.com/webstore/detail/uyp-free-blasts-th...) that auto-logins into pretty much any bank or financial web site without having to type anything. The key difference to other password managers is it can auto-fill pretty much anything.
I guess it's part password manager (it stores passwords encrypted in browser storage, not remotely) and part automation wizard :)
Yes, by checking times between actions and number of actions in a time period, and blocking atypical activity. I was IP banned from a site once for a few months, after trying to scrape it too much and hitting links on the site that were hidden from humans.
The random wait settings specified in the post are better than nothing, but still too flimsy. You would need to put hours between requests, only request during a certain 15 hour periods, take days off, and eventually you aren't scraping regularly enough to do much good.
Scraping is not an API, and I should know- I used to do it for a living. Its unreliable. It requires constant maintenance. APIs can break too, but they are meant for the sort of consumption you are trying for.
If you scrape for a living, only do it as a side job.