java爬虫教务信息门户(java爬虫04)

xiaoxiao2021-02-28  87

我从去年12月开始接触爬虫,现在已有足足7个月了,中间一直没搞懂cookie和http协议,时隔这么久,总算弄明白了,也总算爬进去了!!! 昨天开始学习的httpClient,今天用它练手爬一下学校的信息门户吧! http://myportal.sxu.edu.cn/login.portal

1. 抓包

以下信息是通过charm浏览器抓包(快捷键F12)获得的:

1. http://myportal.sxu.edu.cn/ 请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000MS7su8CHOtDnUq6dxd7YGdB:1b4e17ihg Host:myportal.sxu.edu.cn Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 收到: Cache-Control:no-cache="set-cookie, set-cookie2" Content-Language:zh-CN Content-Length:8252 Content-Type:text/html;charset=utf-8 Date:Sun, 09 Jul 2017 09:04:57 GMT Expires:Thu, 01 Dec 1994 16:00:00 GMT Server:IBM_HTTP_Server Set-Cookie:iPlanetDirectoryPro=""; Expires=Thu, 01 Dec 1994 16:00:00 GMT; Path=/; Domain=.sxu.edu.cn Set-Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; Path=/ 2. http://myportal.sxu.edu.cn/captchaGenerate.portal?s=0.5123204417293254 请求: Accept:image/webp,image/*,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 3. http://myportal.sxu.edu.cn/captchaValidate.portal?captcha=mb75&what=captcha&value=mb75&_= 请求: Accept:text/javascript, text/html, application/xml, text/xml, */* Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 X-Prototype-Version:1.5.0 X-Requested-With:XMLHttpRequest 4. http://myportal.sxu.edu.cn/userPasswordValidate.portal Post请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate Accept-Language:zh-CN,zh;q=0.8 Cache-Control:max-age=0 Connection:keep-alive Content-Length:173 Content-Type:application/x-www-form-urlencoded Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg Host:myportal.sxu.edu.cn Origin:http://myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 参数: Login.Token1:2014241032 //密码 Login.Token2:********** goto:http://myportal.sxu.edu.cn/loginSuccess.portal gotoOnFail:http://myportal.sxu.edu.cn/loginFailure.portal 收到: Cache-Control:no-cache Content-Language:zh-CN Content-Length:83 Content-Type:text/html; charset=UTF-8 Date:Sun, 09 Jul 2017 09:12:08 GMT Expires:Thu, 01 Dec 1994 16:00:00 GMT Server:IBM_HTTP_Server Set-Cookie:iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0%3D%40AAJTSQACMDE%3D%23; Path=/; Domain=.sxu.edu.cn 5. http://myportal.sxu.edu.cn/index.portal 请求: Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-CN,zh;q=0.8 Connection:keep-alive Cookie:JSESSIONID=0000pTnkBdxH-RxSCBTh6PQ_iqs:1b4e17ihg; iPlanetDirectoryPro=AQIC5wM2LY4Sfcy4g6cjY8hOCWulRTizy9EbNPUbXBB1hc0=@AAJTSQACMDE=# Host:myportal.sxu.edu.cn Referer:http://myportal.sxu.edu.cn/ Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36

2. 分析

从上面的抓包来看,爬取信息门户的关键是获得 以下两个cookie:

JSESSIONID iPlanetDirectoryPro

JSESSIONID是在第一次请求登录网页时获得, 而iPlanetDirectoryPro是在请求userPasswordValidate.portal后获得 请求userPasswordValidate.portal需要一个JSESSIONID 还需要四个参数,其中:

//账号 Login.Token1:2014241032 //密码 Login.Token2:**********

另外两个参数照抄.

由上分析可得: 我们的爬虫需要请求的页面如下: 1. 请求login.portal,获得JSESSIONID 2. 请求userPasswordValidate.portal,获得iPlanetDirectoryPro 3. 爬取数据

3. 写代码

package info_system; import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import org.apache.http.Header; import org.apache.http.HeaderElement; import org.apache.http.HeaderElementIterator; import org.apache.http.HeaderIterator; import org.apache.http.HttpResponse; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.CookieStore; import org.apache.http.client.ResponseHandler; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.utils.URIBuilder; import org.apache.http.conn.ConnectionKeepAliveStrategy; import org.apache.http.impl.client.BasicCookieStore; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.cookie.BasicClientCookie; import org.apache.http.message.BasicHeaderElementIterator; import org.apache.http.protocol.HTTP; import org.apache.http.protocol.HttpContext; import org.apache.http.util.EntityUtils; import utils.ImageUtils; public class Test { public static final String host = "myportal.sxu.edu.cn"; public static final String url1 = "/login.portal"; public static final String url2 = "/captchaGenerate.portal"; public static final String url3 = "/captchaValidate.portal"; public static final String url4 = "/userPasswordValidate.portal"; public static final String url5 = "/index.portal"; public static void main(String[] args) throws URISyntaxException, ClientProtocolException, IOException { ConnectionKeepAliveStrategy myStrategy = new ConnectionKeepAliveStrategy(){ @Override public long getKeepAliveDuration(HttpResponse response, HttpContext context) { // Honor 'keep-alive' header HeaderElementIterator it = new BasicHeaderElementIterator(response.headerIterator(HTTP.CONN_KEEP_ALIVE)); while (it.hasNext()) { HeaderElement he = it.nextElement(); String param = he.getName(); String value = he.getValue(); if (value != null && param.equalsIgnoreCase("timeout")) { try { return Long.parseLong(value) * 1000; } catch(NumberFormatException ignore) { } } } return 10*1000; } }; CookieStore cookieStore = new BasicCookieStore(); BasicClientCookie cookie = new BasicClientCookie("name", "value"); cookie.setPath("/"); cookie.setAttribute("JSESSIONID", "0000VrUJvmhi3ZW002mOu_e1czy:1b4e17j2v"); CloseableHttpClient httpclient = HttpClients.custom() .setDefaultCookieStore(cookieStore) .setKeepAliveStrategy(myStrategy) .build(); //1.请求登录主页,获取登录主页的cookie URI uri1 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url1) .build(); HttpGet httpGet = new HttpGet(uri1); ResponseHandler<BasicClientCookie> responseHandler = new ResponseHandler<BasicClientCookie>() { @Override public BasicClientCookie handleResponse(HttpResponse response) throws ClientProtocolException, IOException { HeaderIterator hi = response.headerIterator(); while(hi.hasNext()){ Header h = (Header) hi.next(); System.out.println(h.getName()+" --> "+h.getValue()); } return null; } }; httpclient.execute(httpGet,responseHandler); cookieStore.getCookies().forEach(e->System.out.println(e)); boolean b = false; /* //2.请求验证码 URI uri2 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url2) .setParameter("s", "0.5123204417293254") .build(); HttpGet httpGet2 = new HttpGet(uri2); do{ ResponseHandler<Boolean> responseHandler2 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { ImageUtils.writeImg("test.jpg", response.getEntity().getContent()); return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpGet2,responseHandler2); }while(!b); //手动输入验证码: @SuppressWarnings("resource") String captcha = new java.util.Scanner(System.in).nextLine(); //3. 请求验证码验证 URI uri3 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url3) .setParameter("captcha", captcha) .setParameter("what", "captcha") .setParameter("value", captcha) .setParameter("_", "") .build(); HttpGet httpGet3 = new HttpGet(uri3); final String error = "验证码非法"; ResponseHandler<Boolean> responseHandler3 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { String s = EntityUtils.toString(response.getEntity()); System.out.println(s); if(s.equals(error)){ return false; } return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpGet3,responseHandler3); if(b) System.out.println("验证码识别成功"); */ //休息一会,等待服务器响应 try { Thread.sleep(1000); } catch (InterruptedException e1) { e1.printStackTrace(); } //4. 请求账号和密码验证 URI uri4 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url4) .setParameter("Login.Token1", "2014241032") //此处参数为密码 .setParameter("Login.Token2", "**********") .setParameter("goto", "http://myportal.sxu.edu.cn/loginSuccess.portal") .setParameter("gotoOnFail", "http://myportal.sxu.edu.cn/loginFailure.portal") .build(); HttpPost httpPost4 = new HttpPost(uri4); ResponseHandler<Boolean> responseHandler4 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { String s = EntityUtils.toString(response.getEntity()); System.out.println(s); if(s.contains("用户不存在或密码错误")){ return false; } return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpPost4,responseHandler4); if(b){ System.out.println("验证成功"); } //5. 请求主页 URI uri5 = new URIBuilder() .setScheme("http") .setHost(host) .setPath(url5) .build(); HttpGet httpGet5 = new HttpGet(uri5); ResponseHandler<Boolean> responseHandler5 = new ResponseHandler<Boolean>() { @Override public Boolean handleResponse(HttpResponse response) throws ClientProtocolException, IOException { try { String s = EntityUtils.toString(response.getEntity()); //System.out.println(s); if(s.contains("<td class=\"STYLE1\">验证码:</td>")){ return false; } return true; } catch (Exception e) { return false; } } }; b = httpclient.execute(httpGet5, responseHandler5); if(b){ System.out.println("获取主页成功"); }else{ System.out.println("获取主页失败"); } } }

//用于验证码图像保存至本地

package utils; import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.InputStream; public class ImageUtils { /** * 把图像流读取成byte[] * @param inStream * @return * @throws Exception */ public static byte[] readImg(InputStream inStream) throws Exception{ ByteArrayOutputStream outStream = new ByteArrayOutputStream(); //创建一个Buffer字符串 byte[] buffer = new byte[1024]; //每次读取的字符串长度,如果为-1,代表全部读取完毕 int len = 0; //使用一个输入流从buffer里把数据读取出来 while( (len=inStream.read(buffer)) != -1 ){ //用输出流往buffer里写入数据,中间参数代表从哪个位置开始读,len代表读取的长度 outStream.write(buffer, 0, len); } //关闭输入流 inStream.close(); //把outStream里的数据写入内存 return outStream.toByteArray(); } /** * 将imgIs图像流写入到本地imgPath中 * @param imgPath * @param imgIs * @throws Exception */ public static void writeImg(String imgPath,InputStream imgIs) throws Exception{ //得到图片的二进制数据,以二进制封装得到数据,具有通用性 byte[] data = readImg(imgIs); //new一个文件对象用来保存图片,默认保存当前工程根目录 File imageFile = new File(imgPath); //创建输出流 FileOutputStream outStream = new FileOutputStream(imageFile); //写入数据 outStream.write(data); //关闭输出流 outStream.close(); } }
转载请注明原文地址: https://www.6miu.com/read-17949.html

最新回复(0)