Python的爬虫非常简单,现在又有成熟的爬虫框架scrapy。现在,我们来用scrapy爬取自己虾米歌单上的歌曲。 通过这篇博客,你将学到:
基本的爬虫设计模拟登陆维持登陆状态Xpath(中的一点皮毛233)。本文默认读者已经通过scrapy官方文档或中文版安装好了,并试过了测试用例。 然后第一步创建项目:scrapy startproject xiami命令会在当前路径下创建名为xiami的scrapy项目。
从需求的角度出发,先想好我们要爬取的内容,简单一点的话就爬取网页的标题、用户的名字、歌单的歌名。行文顺序参照scrapy官方文档的习惯。
先来修改items.py文件。items是保存数据的容器,它是scrapy框架自己的数据结构,与字典类似,都是键-值对的形式保存数据。定义了items,我们就可以用scrapy的方式保存数据到各种数据库或者本地文件。将来我们要把爬取到的歌单保存到本地的json文件中。
打开item.py文件,默认代码如下: # -*- coding: utf-8 -*- # Define here the models for your scraped items # ... import scrapy class XiamiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass 添加变量初始化 class XiamiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() # 网页的标题 name = scrapy.Field() # 用户的名字 song = scrapy.Field() # 歌单的歌名 pass这里我们相当于只是创建了一个空的字典,然后定义了里面的键值。下面我们需要定义爬虫爬取数据,并把数据交付给items。
当前项目下的spider文件夹下只有一个空的__init__.py文件。该文件夹负责容纳自定义的爬虫模块。在spider文件夹下创建一个py文件,就是一个爬虫模块,当然它现在还没有任何的功能。创建python file——xiami_spider.py,这就是我们用来爬取虾米歌单的爬虫了。然后定义它的基本元素
from scrapy.spiders import CrawlSpider, Rule class XiamiSpider(CrawlSpider): name = "xiaoxia" # 爬虫名:小虾 allowed_domains = ["xiami.com"] start_urls = [ "http://www.xiami.com" ] account_number = '9839****8@qq.com' # 换上你的账号 password = '123456' # 换上你的密码 # 重写了start_request方法,Spider类不会再自动调用该方法来为start_url中的url创建Request def start_requests(self): return [Request("https://login.xiami.com/member/login", meta={'cookiejar': 1}, callback=self.post_login)] 在这个新建的类中,我们继承的是CrawlSpider而不是普通的Spider,CrawlSpider是Spider的子类,所以它会有更多的功能,这样可以少些很多代码。定义了爬虫的名字,最后在运行程序的时候,需要用到这个名字。定义了爬取区域的大小。如果不讲范围限制在虾米网站的网页中,爬虫如果不停地最终网页中的新链接的话,可能会爬取到很多无关网站的无关内容定义了初始的URL,对spider来说,爬取的循环类似下文: 调用默认start_requensts()函数,以初始的URL初始化Request,并设置回调函数。 当该request下载完毕并返回时,将生成response,并作为参数传给该回调函数。spider中初始的request是通过调用 start_requests() 来获取的。 start_requests() 读取 start_urls 中的URL, 并以 parse 为回调函数生成 Request 。在回调函数内分析返回的(网页)内容,返回 Item 对象或者 Request 或者一个包括二者的可迭代容器。 返回的Request对象之后会经过Scrapy处理,下载相应的内容,并调用设置的callback函数(函数可相同)。 在回调函数内,您可以使用 选择器(Selectors) (您也可以使用BeautifulSoup, lxml 或者您想用的任何解析器) 来分析网页内容,并根据分析的数据生成item。最后,由spider返回的item将被存到数据库(由某些 Item Pipeline 处理)或使用 Feed exports 存入到文件中。 # coding=utf-8 from scrapy.selector import Selector from scrapy.http import Request, FormRequest, HtmlResponse from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from xiami.items import XiamiItem from sys import argv class XiamiSpider(CrawlSpider): print argv name = "xiaoxia" # 爬虫名:小虾 allowed_domains = ["xiami.com"] start_urls = [ "http://www.xiami.com" ] account_number = '983910368@qq.com' # 换上你的账号 password = '159661312' # 换上你的密码 headers = { "Accept": "application/json, text/javascript, */*; q=0.01", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.8", "Connection": "keep-alive", "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36", "Referer": "https://login.xiami.com/member/login?spm=a1z1s.6843761.226669498.1.2iL1jx" } ''' "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/57.0.2987.133 Safari/537.36", ''' rules = { Rule(LinkExtractor(allow=('/space/lib-song',)), callback='parse_page', follow=True), } # 重写了start_request方法,Spider类不会再自动调用该方法来为start_url中的url创建Request def start_requests(self): return [Request("https://login.xiami.com/member/login", meta={'cookiejar': 1}, callback=self.post_login)] # FormRequest def post_login(self, response): print 'Preparing login' # 下面这句话用于抓取请求页面后返回页面汇总的_xiamitoken字段的文字,用于成功提交表单 _xiamitoken = Selector(response).xpath('//input[@name="_xiamitoken"]/@value').extract_first() print '验证信息: ', _xiamitoken # FormRequest.from_response是Scrapy提供的一个函数,用于post表单 # 登陆成功后,会调用after_login回调函数 return [FormRequest.from_response(response, meta={'cookiejar': response.meta['cookiejar']}, headers=self.headers, formdata={ 'source': 'index_nav', '_xiamitoken': _xiamitoken, 'email': self.account_number, 'password': self.password }, callback=self.after_login, dont_filter=True)] def after_login(self, response): print 'after login=======' for url in self.start_urls: yield Request(url, meta={'cookiejar': response.meta['cookiejar']}) # 创建Request def _requests_to_follow(self, response): if not isinstance(response, HtmlResponse): return seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = Request(url=link.url, callback=self._response_downloaded) # 重写 r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar']) yield rule.process_request(r) def parse_page(self, response): # print 'hh' mysong_list = Selector(response) songs = mysong_list.xpath('//td[@class="song_name"]/a/@title').extract() print songs[0] for song in songs: item = XiamiItem() item['title'] = 'xiami_music' item['name'] = self.account_number item['song'] = song yield item # print '---\n' # nexturl = mysong_list.xpath('//a[@class="p_redirect_l"]/@href').extract_first() # yield self.make_requests_from_url(nexturl) # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json import codecs from scrapy.exceptions import DropItem class XiamiPipeline(object): def __init__(self): self.song_seen = set() self.file = codecs.open('xiamisongs.jl', 'w', encoding='utf-8') def process_item(self, item, spider): """ 每个item pipeline组件都需要调用该方法, 这个方法必须返回一个Item(或任何集成类)对象, 或抛出DropItem异常, 被丢弃的item将不被后面的pipeline处理 :param item: :param spider: :return: """ # 过滤缺失数据 # if True: # return item # else: # raise DropItem('reason') if spider.name == 'xiaoxia': if item['song'] in self.song_seen: raise DropItem('Duplicate song found: %s' % item['song']) else: self.song_seen.add(item['song']) '''保存到json文件(非必须)''' line = json.dumps(dict(item), ensure_ascii=False) + '\n' self.file.write(line) return item def close_spider(self, spider): print 'spider close' self.file.close() # -*- coding: utf-8 -*- # Scrapy settings for xiami project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'xiami' SPIDER_MODULES = ['xiami.spiders'] NEWSPIDER_MODULE = 'xiami.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = 'xiami (+http://www.xiami.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 0.25 # The download delay setting will honor only one of: # #CONCURRENT_REQUESTS_PER_DOMAIN = 16 # #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = True # Disable Telnet Console (enabled by default) # #TELNETCONSOLE_ENABLED = False # Override the default request headers: # #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # } # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # #SPIDER_MIDDLEWARES = { # 'xiami.middlewares.MyCustomSpiderMiddleware': 543, # } # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # #DOWNLOADER_MIDDLEWARES = { # 'xiami.middlewares.MyCustomDownloaderMiddleware': 543, # } # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # # EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, # } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'xiami.pipelines.XiamiPipeline': 300, # 0-1000表示运行顺序 } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html # #AUTOTHROTTLE_ENABLED = True # The initial download delay # #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server # #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # #HTTPCACHE_ENABLED = True # #HTTPCACHE_EXPIRATION_SECS = 0 # #HTTPCACHE_DIR = 'httpcache' # #HTTPCACHE_IGNORE_HTTP_CODES = [] # #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' #!/usr/bin/python # coding=utf-8 # 开始爬虫的脚本文件 from scrapy.cmdline import execute # execute('scrapy crawl xiaoxia'.split()) execute('scrapy crawl xiaoxia -o xiamisongs.jl'.split())这是草稿,下周再修改完善
