分布式爬虫实战（一）互联网协议相关基础知识

xiaoxiao2021-02-28 135

HTTP 协议

HTTP HEADER

HTTP REQUEST 的 Header 部分，需要重点设置某些参数

 Accept: text/plain  Accept-Charset: utf-8  Accept-Encoding: gzip, deflate  Accept-Language: en-US  Connection: keep-alive  Content-Length: 348  Content-Type: application/x-www-form-urlencoded  Date: Tue, 15 Nov 1994 08:12:31 GMT  Host: en.wikipedia.org:80 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 Cookie: CP=H2; WMF-Last-Access=10-Jul-2017; WMF-Last-Access-Global=10-Jul-2017; GeoIP=HK:HCW:Hong_Kong:22.28:114.15:v4

HTTP RESPONSE 的 HEADER 中需要注意的

 Accept-Patch: text/example;charset=utf-8  Cache-Control: max-age=3600  Content-Encoding: gzip  Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT  Content-Language: da  Content-Length: 348  ETag: “737060cd8c284d8af7ad3082f209582d“  Expires: Thu, 01 Dec 1994 16:00:00 GMT  Location: http://www.w3.org/pub/WWW/People.html  Set-Cookie: UserID=JohnDoe; Max-Age=3600; Version=1  Status: 200 OK

使用 Chrome 的 Postman 插件抓包

我常用的抓包的工具：firebug（似乎已经停止维护）、Postman、Wireshark。

Postman 的优势是使用简单、可以自动生成一些代码、可以伪造／重发请求等。

安装 Chrome 的 Postman 插件之后打开 Inspector ，点击需要查看的请求，可以看到右边面板出现请求／响应的数据：

右边面板的 “ Code ” 按钮可以生成请求代码：

响应码

每个 HTTP 请求都会返回一个状态码

2XX ：成功3XX ：跳转4XX ：客户端错误5XX ：服务器端错误  400 Bad Request 客户端请求有语法错误，不能被服务器所理解 401 Unauthorized 请求未经授权，这个状态代码必须和WWW-Authenticate报头域一起使用  403 Forbidden 服务器收到请求，但是拒绝提供服务 404 Not Found 请求资源不存在，eg:输入了错误的URL500 Internal Server Error 服务器发生不可预期的错误 503 Server Unavailable 服务器当前不能处理客户端的请求，一段时间后可能恢复正常 300 Multiple Choices 存在多个可用的资源，可处理或丢弃 301 Moved Permanetly 重定向 302 Found 重定向 304 Not Modified 请求的资源未更新，丢弃

宽度优先和深度优先

对于宽度优先和深度优先哪个好这个问题，答案是：It depends.

一般情况下，宽度优先会更好，理由是：

重要的网页和种子节点的距离较近宽度优先有利于多爬虫并发协作万维网深度一般不会很深，有很多路径可以到达

当然，套路的说法是：宽度优先和深度优先结合。

宽度优先示例代码：

import urllib2 from collections import deque import json from lxml import etree import httplib import hashlib from pybloomfilter import BloomFilter class CrawlBSF: request_headers = { 'host': "www.mafengwo.cn", 'connection': "keep-alive", 'cache-control': "no-cache", 'upgrade-insecure-requests': "1", 'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36", 'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 'accept-language': "zh-CN,en-US;q=0.8,en;q=0.6" } cur_level = 0 max_level = 5 dir_name = 'iterate/' iter_width = 50 downloaded_urls = [] du_md5_file_name = dir_name + 'download.txt' du_url_file_name = dir_name + 'urls.txt' download_bf = BloomFilter(1024*1024*16, 0.01) cur_queue = deque() child_queue = deque() def __init__(self, url): self.root_url = url self.cur_queue.append(url) self.du_file = open(self.du_url_file_name, 'a+') try: self.dumd5_file = open(self.du_md5_file_name, 'r') self.downloaded_urls = self.dumd5_file.readlines() self.dumd5_file.close() for urlmd5 in self.downloaded_urls: self.download_bf.add(urlmd5[:-2]) except IOError: print "File not found" finally: self.dumd5_file = open(self.du_md5_file_name, 'a+') def enqueueUrl(self, url): self.child_queue.append(url) def dequeuUrl(self): try: url = self.cur_queue.popleft() return url except IndexError: self.cur_level += 1 if self.cur_level == self.max_level: return None if len(self.child_queue) == 0: return None self.cur_queue = self.child_queue self.child_queue = deque() return self.dequeuUrl() def getpagecontent(self, cur_url): print "downloading %s at level %d" % (cur_url, self.cur_level) try: req = urllib2.Request(cur_url, headers=self.request_headers) response = urllib2.urlopen(req) html_page = response.read() filename = cur_url[7:].replace('/', '_') fo = open("%s%s.html" % (self.dir_name, filename), 'wb+') fo.write(html_page) fo.close() except urllib2.HTTPError, Arguments: print Arguments return except httplib.BadStatusLine: print 'BadStatusLine' return except IOError: print 'IO Error at ' + filename return except Exception, Arguments: print Arguments return # print 'add ' + hashlib.md5(cur_url).hexdigest() + ' to list' dumd5 = hashlib.md5(cur_url).hexdigest() self.downloaded_urls.append(dumd5) self.dumd5_file.write(dumd5 + '\r\n') self.du_file.write(cur_url + '\r\n') self.download_bf.add(dumd5) html = etree.HTML(html_page.lower().decode('utf-8')) hrefs = html.xpath(u"//a") for href in hrefs: try: if 'href' in href.attrib: val = href.attrib['href'] if val.find('javascript:') != -1: continue if val.startswith('http://') is False: if val.startswith('/'): val = 'http://www.mafengwo.cn' + val else: continue if val[-1] == '/': val = val[0:-1] # if hashlib.md5(val).hexdigest() not in self.downloaded_urls: if hashlib.md5(val).hexdigest() not in self.download_bf: self.enqueueUrl(val) else: print 'Skip %s' % (val) except ValueError: continue def start_crawl(self): while True: url = self.dequeuUrl() if url is None: break self.getpagecontent(url) self.dumd5_file.close() self.du_file.close() crawler = CrawlBSF("http://www.mafengwo.cn") crawler.start_crawl()

不重复抓取策略

因为一个网页包含的链接可能已经被抓取过，这个时候就需要对待爬取的URL队列进行去重。考虑如下策略：

数据库 : 将访问过的URL保存到数据库。这种方法效率太低。HashSet : 用HashSet将访问过的URL保存起来。那只需接近O(1)的代价就可以查到一个URL是否被访问过了。这种方法快但是消耗内存（土豪随意）。单向哈希再 HashSet : URL经过MD5或SHA-1等单向哈希后再保存到HashSet或数据库。 Bit-Map : 建立一个BitSet，将每个URL经过一个哈希函数映射到某一位。使用 Redis : 真正的项目里边会使用 Redis 之类的数据库进行单项哈希再保存，检验完成再决定是否抓取

网站分析

评估网页规模

使用 SiteMap 分析网站结构

转载请注明原文地址: https://www.6miu.com/read-21867.html

技术

最新回复(0)