本篇文章主要针对Python爬虫爬取微博内容(也可类似实现图片)。通过给定初始爬取起点用户id,获取用户关注其他用户,不断爬取,直到达到要求。
一、项目结构:
1. main.py中对应程序过程逻辑
2. url_manager.py对应管理URL
3. html_parser.py 将网页下载器、网页解析器、博文保存封装在了一起。(理论上应该分开,但是我这里图方便就合在一起了)
二、程序介绍:
1. 主函数main.py:
程序代码:
from craw4weibo import html_parser,url_manager class SpiderMain(object): def __init__(self): self.urls = url_manager.UrlManager() self.parser = html_parser.HtmlParser() def craw(self, uid): count = 1 root_url = 'https://weibo.cn/%s' % uid self.urls.add_new_url(root_url) while self.urls.has_new_url(): try: new_url = self.urls.get_new_url() print 'craw %d : %s' % (count, new_url) new_urls = self.parser.parse(new_url) self.urls.add_new_urls(new_urls) if count == 1: break count = count + 1 except: print "craw failed" if __name__ == "__main__": root_uid = "#Your root id" spider = SpiderMain() spider.craw(root_uid) SpiderMain类中,craw函数对应逻辑:
1. 将起始用户id拼凑成起始URL,添加到URL管理器。
2. 如果URL管理器中的new_url容器不为空,进入while循环
3.从new_url中获取一个新的url,传入html_parser中的parse函数。在parse函数中,下载、解析该url网页。保存需要的博文,同时获取新的urls,作为参数返回
4.将3中返回的urls存入URL管理器,查看是否达到爬取条件。在这里我只爬取1个用户。count设为1
5.重复2的操作直到退出while
2. URL管理器url_manager.py:
class UrlManager(object): def __init__(self): self.new_urls = set() self.old_urls = set() def add_new_url(self, url): if url is None: return if url not in self.new_urls and url not in self.old_urls: self.new_urls.add(url) def add_new_urls(self,urls): if urls is None or len(urls) == 0: return for url in urls: self.add_new_url(url) def has_new_url(self): return len(self.new_urls) != 0 def get_new_url(self): new_url = self.new_urls.pop() self.old_urls.add(new_url) return new_url
功能都很简单,主要是储存新、旧URL和取出URL。需要注意的是在添加新的ULR的时候需要注意该URL是否已经在未使用的URL容器或者在已使用的URL容器。如果在,忽略该URL
3. 网页下载、解析、保存、获取新URL html_parser.py:
import sys import os import requests import time from lxml import etree class HtmlParser(object): def _save_user_data(self, url): print url cookie = "#Your Cookie" headers = "#Your header" #get web page using request response = requests.get(url, cookies=cookie, headers=headers) print "code:", response.status_code html_cont = response.content # get current user_id res_url = url.split("/") num = len(res_url) user_id = res_url[num - 1] #print user_id selector = etree.HTML(html_cont) #print selector.xpath('//input[@name="mp"]') #get page of the weibo pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value']) print "page",pageNum result = "" word_count = 1 times = 5 one_step = pageNum / times for step in range(times): if step < times - 1: i = step * one_step + 1 j = (step + 1) * one_step + 1 else: i = step * one_step + 1 j = pageNum + 1 for page in range(i, j): try: #download weibo of certain page url = 'https://weibo.cn/%s?filter=1&page=%d' % (user_id, page) #print url lxml = requests.get(url, cookies=cookie, headers=headers).content selector = etree.HTML(lxml) content = selector.xpath('//span[@class="ctt"]') for each in content: text = each.xpath('string(.)') #print text if word_count >= 3: text = "%d: " % (word_count - 2) + text + "\n" else: text = text + "\n\n" result = result + text word_count += 1 print 'getting',page, ' page word ok!' sys.stdout.flush() except: print page, 'error' print page, 'sleep' sys.stdout.flush() time.sleep(3) print 'continuing', step + 1, 'stopping' time.sleep(10) try: #print "1" file_name = "%s.txt" % user_id fo = open(file_name, "wb") #print "2" fo.write(result.encode('utf-8')) #print "3" fo.close() print 'finishing word spiderring' except: print 'cannot find adress' sys.stdout.flush() def _get_new_urls(self, url): new_urls = set() res_url = url.split("/") num = len(res_url) user_id = res_url[num - 1] cookie = "#Your Cookie" headers = "#Your header" url = 'https://weibo.cn/%s/follow' % user_id response = requests.get(url, cookies=cookie,headers=headers).content selector = etree.HTML(response) pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value']) times = 5 one_step = pageNum / times for step in range(times): if step < times - 1: i = step * one_step + 1 j = (step + 1) * one_step + 1 else: i = step * one_step + 1 j = pageNum + 1 for page in range(i, j): try: url = 'https://weibo.cn/%s/follow?page=%d' % (user_id, page) lxml = requests.get(url, cookies=cookie, headers=headers).content selector = etree.HTML(lxml) content = selector.xpath('/html/body/table/tr/td[2]/a[1]') for c in content: temp_url = c.attrib['href'] if temp_url is not None: new_urls.add(temp_url) print 'geint follow',page, 'page word ok!' sys.stdout.flush() except: print page, 'error' print page, 'sleep' sys.stdout.flush() time.sleep(3) print 'continuing', step + 1, 'stopping' time.sleep(10) return new_urls def parse(self, url): if url is None: return print "saving" self._save_user_data(url) print "getting" new_urls = self._get_new_urls(url) print "finishing" return new_urls 主要的逻辑就是保存当前用户博文、和获取新的关注者
1. 保存博文部分代码解释:
下载部分比较简单,使用了requests库,通过伪装浏览器发送请求,获取博文。分5次将所有page博文下载并保存完毕,计算每次对应的page范围。这些都比较简单。这里有两个需要注意的地方:
1). sleep的时间:现在微博反爬虫比较敏感,访问过于频繁cookie很快就会失效,出现访问403。所以需要设置sleep的时间。可以分别设置为60 和 300.不过这样效率很慢,每爬取一页sleep一分钟,每爬取一轮sleep五分钟。这样的话有的用户有900多页,就需要很久。这个问题我还没有解决。
2). 关于selector的xpath问题,在chorme的F12开发者模式中直接获取的xpath 为chorme优化过的,可能会导致你在应用到selector时出现空匹配。例如:
content = selector.xpath('/html/body/table/tr/td[2]/a[1]') 这个在chorme中,就会优化为../table/tbody/tr.....这样就无法匹配,所以需要删除掉tbody。 关于xpath,可以参见菜鸟教程
最后提醒:如果出现403拒绝访问,或者在爬取时打印page error,再或者在获取pageNum出现数组越界(由于获取到的是空数组所以会越界),一般是你的cookie挂掉了,或者你被新浪暂时封了ip
参考文章:
1. imooc 爬虫课程
2. 《python爬虫爬取指定用户微博图片及内容,并进行微博分类及使用习惯分析,生成可视化图表》
表示感谢!
本文Github源码下载
P.S. 文章不妥之处还望指正