Anyone have tips to avoid Akamai's bot detection?
My boss is telling me to scrape a bunch of sites. Even using proxies (about 400 that I cycle through) they all get banned pretty quickly and then they just get a page like the one included.
I've tried telling him that this is a dumb idea and not a viable business model, but all he cares about is this idea that we can "scrape the entire web" and sell other peoples' stuff for a profit.
I know it's a long shot, but I'm just trying to hang on to this job until I find a new one. Anything will help.
What do you use to scrape them? Maybe experiments with user-agent would help
I'm using Python+Scrapy. I'm using random user agents via the library here:
An interesting side-effect is that getting banned on one site can sometimes cascade over to other sites. For instance, a ban on Home Depot's site yields a ban on Pizza Hut's site. Not all of the Akamai hosted sites get banned at once though.
At the moment I've only been scraping Home Depot's website, but as a result I'm also banned on: pizzahut.com, costco.com, basspro.com, cabelas.com, staples.com, kmart.com and usps.com and I haven't scraped any of those sites. Most of them I haven't even visited before today. I just found out about my Pizza Hut and USPS bans because I wanted to order a pizza and check on a delivery I'm expecting.
When I disabled the proxy I was testing it worked again.
I've been considering using the following instead for user agents:
It seems like less work and might help avoid situations with old user agents that are getting flagged. Perhaps they're seeing that I'm using outdated useragents. I haven't been diligent about checking them. The list I acquired is probably a couple of months old.
I used to plug in a CD on my parents computer and play a game called Petz (Or two versions called Dogs and Cats i cant remember). Best shit ever just owned a bunch of virtual dogs/cats and got to bring them everywhere and fling them around your screen if they grabbed a toy.
Leave this field blank: