简易JAVA爬虫练习,为新手总结的三种爬虫方法

xiaoxiao2021-02-28  67

这是想学习java爬虫的新手必经之路,也是最简单的几种JAVA爬虫爬取网页信息的方法,当然,这几种方法爬取的网页有限,对于需要登录的网页则还需进行更复杂的操作,这里就不做多余的解释,毕竟是写给新手的,希望对刚学习JAVA爬虫的人能有点帮助。

一、 通过urlconnection抓取信息:

步骤: 1.获取url 2.获取http请求 3.获取状态码 4.根据状态吗返回信息。 代码: package com.soft.crawler; import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URL; import sun.net.www.protocol.http.HttpURLConnection; public class Crawler { public static void main(String[] args){ String r; //1.新建url对象,表示要访问的网页 try { URL url = new URL("http://www.sina.com.cn"); //2.建立http连接,返回连接对象urlconnection HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection(); //3.获取相应的http状态码, int responseCode= urlConnection.getResponseCode(); //4.如果获取成功,从URLconnection对象获取输入流来获取请求网页的源代码 if(responseCode == 200){ BufferedReader reader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "utf-8")); while((r=reader.readLine())!=null){ System.out.println(r); } }else{ System.out.println("获取不到源代码 ,服务器响应代码为:"+responseCode); } } catch (Exception e) { System.out.println("获取不到网页源码:"+e); } } }

二、通过httpclient抓取信息:

步骤: //创建一个客户端,类似打开一个浏览器 HttpClient httpClient = new HttpClient(); //创建一个get方法,类似在浏览器中输入一个地址,path则为URL的值 GetMethod getMethod = new GetMethod(path); //获得响应的状态码 int statusCode = httpClient.executeMethod(getMethod); //得到返回的类容 String resoult = getMethod.gerResponseBodyAsString(); //释放资源 getMethod.releaseConnection(); 代码: import java.io.FileWriter; import java.io.IOException; import java.util.Scanner; import org.apache.commons.httpclient.HttpClient; import org.apache.commons.httpclient.HttpException; import org.apache.commons.httpclient.HttpStatus; import org.apache.commons.httpclient.methods.GetMethod; public class Crawler{ private static HttpClient httpClient = new HttpClient(); static GetMethod getmethod; public static boolean downloadPage(String path) throws HttpException, IOException { getmethod = new GetMethod(path); //获得响应状态码 int statusCode = httpClient.executeMethod(getmethod); if(statusCode == HttpStatus.SC_OK){ System.out.println("response="+getmethod.getResponseBodyAsString()); //写入本地文件 FileWriter fwrite = new FileWriter("hello.txt"); String pageString = getmethod.getResponseBodyAsString(); getmethod.releaseConnection(); fwrite.write(pageString,0,pageString.length()); fwrite.flush(); //关闭文件 fwrite.close(); //释放资源 return true; } return false; } /** * 测试代码 */ public static void main(String[] args) { // 抓取制指定网页,并将其输出 try { Scanner in = new Scanner(System.in); System.out.println("Input the URL of the page you want to get:"); String path = in.next(); System.out.println("program start!"); Crawler.downloadPage(path); System.out.println("Program end!"); } catch (HttpException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } }

三,通过jsoup获取网页信息:

package com.soft.test; import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class spider { public static void main(String[] args) throws IOException{ String url = "http://www.baidu.com"; Document document = Jsoup.connect(url).timeout(3000).get(); //通过Document的select方法获取属性结点集合 Elements elements = document.select("a"); //得到节点的第一个对象 //Element element = elements.get(0); System.out.println(element); }

}

转载请注明原文地址: https://www.6miu.com/read-51107.html

最新回复(0)