第八天 网上好多爬虫都是py2的(:з」∠) 今天找了条py3的爬虫尝试爬学校的门户
import io import sys import urllib.request web_header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36', 'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d/vQadvCPz780+9+1o=@AAJTSQACMDI=#; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'} sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') url_mh='http://xxxx.xxxx.xxx.xx/index.portal' req=urllib.request.Request(url=url_mh,headers=web_header) resp=urllib.request.urlopen(req) data=resp.read() print(data.decode('utf-8'))分析下这些代码web_header 是头部信息包含了cookie之前cookie都是单写在cookie模块里用这个方法很简单= =(:з」∠) sys.stdout 重定向页面的编码为utf8 url_mh 门户登陆成功的界面 req为设置好的关联性修改头部信息 resq发送post请求返回的数据通过read参数获取 然后显示出来用utf-8的格式 两个utf-8一定要设置好了一个是显示的格式一个是解码的格式
经过诸多调试可以爬到公告了
import io import sys import urllib.request import re import requests web_header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36', 'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d/vQadvCPz780+9+1o=@AAJTSQACMDI=#; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'} sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') url_mh='http://XXX/index.portal' req=urllib.request.Request(url=url_mh,headers=web_header) #resp=urllib.request.urlopen(req) # data=resp.read() # print(data.decode('utf8')) resp = requests.get(url=url_mh,headers=web_header) resp.encoding = 'utf-8' # print(resp.text) gonggao = re.findall('<img src="images/s.gif" alt="" /></a><a title='"(.*?)"' class="rss-title" onclick=',resp.text,re.S) for each in gonggao: print (each)这个正则没写好这样写就没问题了
import io import sys import urllib.request import re import requests web_header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36', 'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d/vQadvCPz780+9+1o=@AAJTSQACMDI=#; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'} sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') url_mh='http://xxx.cn/index.portal' req=urllib.request.Request(url=url_mh,headers=web_header) #resp=urllib.request.urlopen(req) # data=resp.read() # print(data.decode('utf8')) resp = requests.get(url=url_mh,headers=web_header) resp.encoding = 'utf-8' # print(resp.text) gonggao = re.findall('<a title=\'(.*?)\' class="rss-title"',resp.text,re.S) for each in gonggao: print (each)今天有点晚就不弄导出到text了下次吧·
