基于Headers的反爬虫,从请求头进行反爬是比较常见的措施,大部分网站会对Headers中的User-Agent和Referer字段进行检测。突破方法就是根据浏览器的正常访问请求头对爬虫的请求头进行修改,尽可能的和浏览器保持一致
下面是编写类来存储user-agent池的功能,下次可以直接调用
class HtmlDownloader(object): def __init__(self): self.url_manager = UrlManager() USER_AGENT = random.choice([ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1" ]) self.headers = {'User-Agent':USER_AGENT} def downloader(self,url): response = requests.get(url,headers=self.headers) response.encoding = 'utf-8' if response.status_code in [int('20'+str(x)) for x in range(10)]: return response
欢迎关注公众号:日常bug,每天写至少一篇技术文章,每天进步一点点。
