我从去年12月开始接触爬虫,现在已有足足7个月了,中间一直没搞懂cookie和http协议,时隔这么久,总算弄明白了,也总算爬进去了!!! 昨天开始学习的httpClient,今天用它练手爬一下学校的信息门户吧! http://myportal.sxu.edu.cn/login.portal
以下信息是通过charm浏览器抓包(快捷键F12)获得的:
1. http://myportal.sxu.edu.cn/ 请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000MS7su8CHOtDnUq6dxd7YGdB:1b4e17ihg Host:myportal.sxu.edu.cn Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 收到: Cache-Control:no-cache="set-cookie, set-cookie2" Content-Language:zh-CN Content-Length:8252 Content-Type:text/html;charset=utf-8 Date:Sun, 09 Jul 2017 09:04:57 GMT Expires:Thu, 01 Dec 1994 16:00:00 GMT Server:IBM_HTTP_Server Set-Cookie:iPlanetDirectoryPro=""; Expires=Thu, 01 Dec 1994 16:00:00 GMT; Path=/; Domain=.sxu.edu.cn Set-Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; Path=/ 2. http://myportal.sxu.edu.cn/captchaGenerate.portal?s=0.5123204417293254 请求: Accept:image/webp,image/*,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 3. http://myportal.sxu.edu.cn/captchaValidate.portal?captcha=mb75&what=captcha&value=mb75&_= 请求: Accept:text/javascript, text/html, application/xml, text/xml, */* Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 X-Prototype-Version:1.5.0 X-Requested-With:XMLHttpRequest 4. http://myportal.sxu.edu.cn/userPasswordValidate.portal Post请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate Accept-Language:zh-CN,zh;q=0.8 Cache-Control:max-age=0 Connection:keep-alive Content-Length:173 Content-Type:application/x-www-form-urlencoded Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Origin:http://myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 参数: Login.Token1:2014241032 //密码 Login.Token2:********** goto:http://myportal.sxu.edu.cn/loginSuccess.portal gotoOnFail:http://myportal.sxu.edu.cn/loginFailure.portal 收到: Cache-Control:no-cache Content-Language:zh-CN Content-Length:83 Content-Type:text/html; charset=UTF-8 Date:Sun, 09 Jul 2017 09:12:08 GMT Expires:Thu, 01 Dec 1994 16:00:00 GMT Server:IBM_HTTP_Server Set-Cookie:iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23; Path=/; Domain=.sxu.edu.cn 5. http://myportal.sxu.edu.cn/index.portal 请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0=@AAJTSQACMDE=# Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36从上面的抓包来看,爬取信息门户的关键是获得 以下两个cookie:
JSESSIONID iPlanetDirectoryProJSESSIONID是在第一次请求登录网页时获得, 而iPlanetDirectoryPro是在请求userPasswordValidate.portal后获得 请求userPasswordValidate.portal需要一个JSESSIONID 还需要四个参数,其中:
//账号 Login.Token1:2014241032 //密码 Login.Token2:**********另外两个参数照抄.
由上分析可得: 我们的爬虫需要请求的页面如下: 1. 请求login.portal,获得JSESSIONID 2. 请求userPasswordValidate.portal,获得iPlanetDirectoryPro 3. 爬取数据
//用于验证码图像保存至本地
package utils; import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.InputStream; public class ImageUtils { /** * 把图像流读取成byte[] * @param inStream * @return * @throws Exception */ public static byte[] readImg(InputStream inStream) throws Exception{ ByteArrayOutputStream outStream = new ByteArrayOutputStream(); //创建一个Buffer字符串 byte[] buffer = new byte[1024]; //每次读取的字符串长度,如果为-1,代表全部读取完毕 int len = 0; //使用一个输入流从buffer里把数据读取出来 while( (len=inStream.read(buffer)) != -1 ){ //用输出流往buffer里写入数据,中间参数代表从哪个位置开始读,len代表读取的长度 outStream.write(buffer, 0, len); } //关闭输入流 inStream.close(); //把outStream里的数据写入内存 return outStream.toByteArray(); } /** * 将imgIs图像流写入到本地imgPath中 * @param imgPath * @param imgIs * @throws Exception */ public static void writeImg(String imgPath,InputStream imgIs) throws Exception{ //得到图片的二进制数据,以二进制封装得到数据,具有通用性 byte[] data = readImg(imgIs); //new一个文件对象用来保存图片,默认保存当前工程根目录 File imageFile = new File(imgPath); //创建输出流 FileOutputStream outStream = new FileOutputStream(imageFile); //写入数据 outStream.write(data); //关闭输出流 outStream.close(); } }