python web编程之反爬机制绕过
从一道逆向题引发的思考:
能否直接将要解密的数据,通过python脚本的的方式,提交给相应的网站,并通过脚本抓取相应的结果
现在的网站都添加了相应的反爬取机制,刚开始的几次是可以成功的,但是之后脚本就无法接收到数据了,发现网站上多了一个输入验证码的环节
具体情况要根据访问的网站限制机制来写脚本,不能单纯的使用现成的脚本
具体情况要根据访问的网站限制机制来写脚本,不能单纯的使用现成的脚本
具体情况要根据访问的网站限制机制来写脚本,不能单纯的使用现成的脚本
查询的一些基础资料:
https://www.cnblogs.com/heikeboke/p/8012041.htmlhttp://www.jb51.net/article/95728.htmhttps://www.cnblogs.com/yoyoketang/p/6838596.html
反爬取方式及解决思路:
br.addheaders = [(
'User-Agent',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.7 Safari/537.36')]
headers ={
"Host":
"www.cmd5.com",
"Content-Length":
"1832",
"Cache-Control":
"max-age=0",
"Origin":
"http://www.cmd5.com",
"Upgrade-Insecure-Requests":
"1",
"Content-Type":
"application/x-www-form-urlencoded",
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.7 Safari/537.36",
"Accept":
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer":
"http://www.cmd5.com/",
"Accept-Encoding":
"gzip, deflate",
"Accept-Language":
"zh-CN,zh;q=0.9,en;q=0.8",
"Cookie":
"FirstVisit=2017/10/14 16:49:39; ASP.NET_SessionId=4240erfxxgel3450n4dgddej; comefrom=https://www.baidu.com/link?url=_iyok742ki838ontfqnni8s-yikrus241ocxk3cplqo&wd=&eqid=ed2c528f0003fd1a000000055b18de2e; Hm_lvt_0b7ba6c81309fff7ce4498ec7b107c0b=1528302253,1528328811,1528356400; Hm_lpvt_0b7ba6c81309fff7ce4498ec7b107c0b=1528356400",
"Connection":
"close"
}
方式2:基于用户行为反爬虫(同一IP短时间内多次访问同一页面,或者同一账户短时间内多次进行相同操作)
IP代理
https://www.cnblogs.com/eric8899/p/6122759.htmlhttps://www.cnblogs.com/hearzeus/p/5157016.htmlhttps://www.jianshu.com/p/2577e5bcbf05
import requests
proxy = {
'HTTPS':
'117.85.105.170:808',
'HTTP':
'117.85.105.170:808'}
head = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'Connection':
'keep-alive'}
p = requests.get(
'http://icanhazip.com', headers=head, proxies=proxy)
print(p.text)
import urllib2
import BeautifulSoup
User_Agent =
'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'
header = {}
header[
'User-Agent'] = User_Agent
url =
'http://www.xicidaili.com/nn/1'
req = urllib2.Request(url,headers=header)
res = urllib2.urlopen(req).read()
soup = BeautifulSoup.BeautifulSoup(res)
ips = soup.findAll(
'tr')
f = []
for x
in range(
1,len(ips)):
ip = ips[x]
tds = ip.findAll(
"td")
ip_temp = tds[
1].contents[
0]+
"\t"+tds[
2].contents[
0]+
"\n"
f.append(ip_temp)
import requests
import re
import urllib2
import random
import time
from bs4
import BeautifulSoup
class download(object):
def __init__(self):
self.ip_list=[]
User_Agent =
'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'
header = {}
header[
'User-Agent'] = User_Agent
html = requests.get(
"http://www.xicidaili.com/nn/1", headers=header)
response = html.text
soup = BeautifulSoup(response,
'lxml')
ips = soup.findAll(
'tr')
for x
in range(
1, len(ips)):
ip = ips[x]
tds = ip.findAll(
"td")
ip_temp = tds[
1].contents[
0] +
":" + tds[
2].contents[
0] +
"\n"
self.ip_list.append(ip_temp)
print self.ip_list
self.user_agent_list=[
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
def get(self,url,timeout,proxy=None,num_retries=6):
ua=random.choice(self.user_agent_list)
header={
"User-Agent":ua}
if proxy==
None:
try:
response=requests.get(url,headers=header,timeout=timeout)
return response
except:
if num_retries>
0:
time.sleep(
10)
print(
u"获取网页错误,10s后将获取倒数第:",num_retries,
u"次")
return self.get(url,timeout,num_retries-
1)
else:
print(
u"开始使用代理")
time.sleep(
10)
IP=
"".join(str(random.choice(self.ip_list)).strip())
proxy={
"http":IP}
return self.get(url,timeout,proxy)
else:
try:
IP=
"".join(str(random.choice(self.ip_list)).strip())
proxy={
"http":IP}
response=requests.get(url,headers=header,proxies=proxy,timeout=timeout)
return response
except:
if num_retries>
0:
time.sleep(
10)
IP=
"".join(str(random.choice(self.ip_list)).strip())
print(
u"正在更换代理,10s后将重新获取第",num_retries,
u"次")
print(
u"当前代理是:",proxy)
return self.get(url,timeout,proxy,num_retries-
1)
else:
print(
u"代理发生错误,取消代理")
return self.get(url,
3)
request=download()
def qsbk(url):
html=request.get(url,
3)
dz=BeautifulSoup(html.text,
"html.parser").find_all(
"div",{
"class":
"content"})
for joke
in dz:
duanzi=joke.get_text()
print(duanzi)
if __name__==
"__main__":
url=
"http://www.qiushibaike.com/"
qsbk(url)
每次请求后随机间隔几秒再进行下一次请求
脚本里添加控制访问时间的代码,如sleep等
import time
time.sleep(
5 )
选择一个同类的没有防爬机制的网站,这个方法很佛系,应该说不能算是一个应对反爬机制的方法
需要尽快完成脚本的时候可以使用缺点:避开了问题而没有解决问题,不一定能找到
import requests
import mechanize
import re
flagmd5 =
'762306AB890905CFF6887D5A75776382'
def web_md5(md5_string):
br = mechanize.Browser()
br.set_handle_equiv(
True)
br.set_handle_redirect(
True)
br.set_handle_referer(
True)
br.set_handle_robots(
False)
br.set_handle_gzip(
False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=
1)
br.addheaders = [(
'User-Agent',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.7 Safari/537.36')]
br.open(
'http://pmd5.com/')
br.select_form(name =
"formMd5")
br.form[
'key'] = md5_string
br.submit()
page = br.response().read()
pattern =
"<em>.{4}</em>”!</p></div>"
flag = re.findall(pattern,page,flags=
0)
print flag
if flag:
print flag[
0][
4:
8]
print page
web_md5(flagmd5)
方式3:动态页面的反爬虫
Selenium+PhantomJS
Selenium:pip install Selenium
Selenium是一个Web的自动化测试工具,最初是为网站自动化测试而开发的,最初是为网站自动化测试而开发的,类型像我们玩游戏用的按键精灵,可以按指定的命令自动化操作,不同是Selenium可以直接运行在浏览器上,它支持所有主流的浏览器(包括PhantomJS这些无界面的浏览器)。 Selenium可以根据我们的指令,让浏览器自动加载页面,获取需要的页面,甚至页面截屏,或者判断网站上某些动作是否发生。 Selenium自己不带浏览器,不支持浏览器的功能,它需要与第三方浏览器结合在一起才能使用。但是我们有时候需要让它内嵌在代码中运行,所有我们而已用一个叫PhantomJS的工具代替真实的浏览器。PhantomJS : 下载安装包,添加环境变量
PhantomJS是一个基于Webkit的”无界面”(headless)浏览器,它会把网站加载到内存并执行页面上的JavaScript,因为不会展示图形界面,所以运行起来比完整的浏览器更高效。如果我们把Selenium和PhantomJS结合在一起,就可以运行一个非常强大的网络爬虫了,这个爬虫可以处理JavaScript、Cookie、headers,以及任何我们真实用户需要做的事情。
使用方法,参考链接
https://www.cnblogs.com/miqi1992/p/8093958.htmlhttps://www.cnblogs.com/psv-fuyang/articles/7891871.htmlhttps://blog.csdn.net/qq_30242609/article/details/70859891https://jiayi.space/post/scrapy-phantomjs-seleniumdong-tai-pa-chong#fb_new_commenthttps://blog.csdn.net/qq_33689414/article/details/78631009https://www.cnblogs.com/luxiaojun/p/6144748.htmlhttps://www.jianshu.com/p/520749be7377https://www.jianshu.com/p/9d408e21dc3ahttps://segmentfault.com/a/1190000007362337
当你大概了解了上面的内容之后会发现,好吧,这两个狗子已经分手了
高版本的Selenium不再支持PhantomFS
https://blog.csdn.net/u010358168/article/details/79749149
https://blog.csdn.net/qq_30242609/article/details/79323963
解决问题的思路:
Selenuim版本降级使用无界面浏览器相关环境配置:https://blog.csdn.net/pengbin790000/article/details/76696714
Selenuim教程
https://www.cnblogs.com/fnng/category/349036.html感觉要学这个的话需要一定的时间:这里提供一个视频教程网站,这里面有详细的课程,可以学的比较完整
视频:http://www.51zxw.net/list.aspx?cid=615文字:http://www.51testing.com/zhuanti/selenium.html