Python爬虫————scrapy的使用

xiaoxiao2021-02-28  19

1、什么是scrapy scrapy是Python开发的一个快速、高层次的屏幕抓取和web抓取的爬虫框架,用于抓取web站点并从页面中提取结构化的数据。 2、Scrapy的安装 也可以通过Anaconda安装可避免以上的坑

3、常用的基本命令

建立一个scrapy工程: scrapy startproject pro 进入工程目录建立一个scrapy爬虫: cd pro 以基础模板建立一个demo爬虫,域名为baidu.com scrapy genspider -t basic demo baidu.com 以自动爬虫模板(也就是将一个网页上的所有链接的内容都依次爬取下来的)建立一个demo1爬虫,域名为baidu.com scrapy genspider -t crawl demo1 baidu.com 两个模板的使用:一般做精准爬虫则用basic模板,做通用爬虫使用crawl模板 运行爬虫demo: scrapy crawl demo 其他命令使用scrapy --help进行查看使用说明

4、Xpath表达式

#!/usr/bin/python # -*-coding:utf-8-*- # __author__ = 'ShenJun' ''' XPath表达式: /:从顶端开始,一层一层的找 text():提取标签的内容 /html/head/title/text():从顶端开始搜索html/head/title标签下的内容 @:提取属性的值 //:寻找当前页中所有的标签 //li:寻找所有的li标签 //li[@class="hidden-xs"]/a/@href:提取属性class="hidden-xs"的标签下的a标签下的href属性的值 '''

5、爬虫示例

爬取百度的标题 在命令行建立好爬虫工程以及爬虫文件后,我们所需要编写的爬虫逻辑主要在spiders下的爬虫文件FirstSpider.py中,items.py用于设置爬取内容存储容器,pipelines.py则用于编写爬取到内容的后续处理,比如将其放到数据库或者存储到磁盘(这个文件可以不编写,只在爬虫文件中编写,但建议编写,为了更好的符合代码的低耦合),settings.py主要设置爬虫相应的配置,当我们使用了pipelines.py时需要在settings.py中开启,否则不能运行,同时需要关闭ROBOTSTXT_OBEY。其他文件一般不需要动。

init.py

# This package will contain the spiders of your Scrapy project # # Please refer to the documentation for information on how to create and manage # your spiders.

FirstSpider.py

# -*- coding: utf-8 -*- import scrapy from FirstScrapy.items import FirstscrapyItem class FirstspiderSpider(scrapy.Spider): name = "FirstSpider" allowed_domains = ["baidu.com"]#目标地址的域名 #目标地址 start_urls = ['http://baidu.com/'] #parse是爬虫的主要逻辑,爬取到的内容都存在repsonse中 def parse(self, response): #实例化类 item=FirstscrapyItem() #将提取出的内容存储到content容器中 item["content"]=response.xpath("/html/head/title/text()").extract() yield item

FirstScrapy__init__.py为空

items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html ''' 用于设置爬取的目标 ''' import scrapy class FirstscrapyItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() content=scrapy.Field()#创建content容器用于存储爬取的内容 link=scrapy.Field()#创建link容器用于存储爬取的链接,容器创建了不一定要用,但用之前一定得创建

middlewares.py

# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # http://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class FirstscrapySpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass def process_start_requests(start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html ''' 用于设置爬取后信息的后续处理 ''' ''' class FirstscrapyPipeline(object): def process_item(self, item, spider): print(item["content"])#处理content容器中的东西 return item ''' class FirstscrapyPipeline(object): def process_item(self, item, spider): for i in range(0,len(item["content"])): print(item["content"][i])#打印爬取到的内容 print(item["link"][i])#打印对应内容的链接 return item

settings.py

# -*- coding: utf-8 -*- # Scrapy settings for FirstScrapy project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html ''' 用于设置配置信息 ''' BOT_NAME = 'FirstScrapy' SPIDER_MODULES = ['FirstScrapy.spiders'] NEWSPIDER_MODULE = 'FirstScrapy.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'FirstScrapy (+http://www.yourdomain.com)' # Obey robots.txt rules #爬虫协议,不遵守则可以让它为false ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'FirstScrapy.middlewares.FirstscrapySpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'FirstScrapy.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { #默认是不开启pipelines,需要在settings中自己设置 'FirstScrapy.pipelines.FirstscrapyPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

scrapy.cfg

# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.org/en/latest/deploy.html [settings] default = FirstScrapy.settings [deploy] #url = http://localhost:6800/ project = FirstScrapy

为了方便,后面的默认文件将不进行展示

通用爬虫,即以crawl为模板的爬虫

爬取糗事百科的所有段子

SecondDemo SecondDemo.py

# -*- coding: utf-8 -*- ''' 自动爬虫 使用自动爬虫模板建立自动爬虫: crawl:自动爬虫模板 basic:基础爬虫模板 SecondDemo:爬虫文件 qiushibaike.com:需要爬取网页的域名 scrapy genspider -t crawl SecondDemo qiushibaike.com ''' import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from qsauto.items import QsautoItem from scrapy.http import Request class SeconddemoSpider(CrawlSpider): name = 'SecondDemo' allowed_domains = ['qiushibaike.com'] # start_urls = ['http://qiushibaike.com/'] #模拟成浏览器的方式访问第一个url,但后续的还是不会模拟成浏览器,所以必须在setting中进行相应的设置 def start_requests(self): ua={"User-Agent":'Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0'} yield Request('https://qiushibaike.com/',headers=ua) rules = ( #设置链接提取的规则,follow表示是否从爬取的页面中继续提取链接进行爬取 Rule(LinkExtractor(allow=r'article'), callback='parse_item', follow=True), ) def parse_item(self, response): i = [] #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() i["content"]=response.xpath("//div[@class='content']/text()").extract() i["link"]=response.xpath("//link[@rel='canonical']/@href").extract() print(i["content"]) print(i["link"]) print("") return i

items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class QsautoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() content=scrapy.Field()#创建content容器用于存储爬取的内容 link=scrapy.Field()#创建link容器用于存储爬取的链接,容器创建了不一定要用,但用之前一定得创建

pipelines.py

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class QsautoPipeline(object): def process_item(self, item, spider): return item

settings.py

# -*- coding: utf-8 -*- # Scrapy settings for qsauto project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'qsauto' SPIDER_MODULES = ['qsauto.spiders'] NEWSPIDER_MODULE = 'qsauto.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #设置的报文头 USER_AGENT ='Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'qsauto.middlewares.QsautoSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'qsauto.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'qsauto.pipelines.QsautoPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

ThirdDemo.py

# -*- coding: utf-8 -*- ''' 爬取天善智能网站所有的课程、课程人数以及课程链接 ''' import scrapy from project3.items import Project3Item from scrapy.http import Request class ThirddemoSpider(scrapy.Spider): name = "ThirdDemo" allowed_domains = ["hellobi.com"] start_urls = ['https://edu.hellobi.com/course/10'] def parse(self, response): items=Project3Item() items["title"]=response.xpath('//ol[@class="breadcrumb"]/li[@class="active"]/text()').extract() items["link"]=response.xpath('//ul[@class="nav nav-tabs"]/li[@class="active"]/a/@href').extract() items["stu"]=response.xpath('//span[@class="course-view"]/text()').extract() yield items #爬取其他课程 for i in range(1,280): #获取其他课程的链接 url="https://edu.hellobi.com/course/"+str(i) #通过Request继续爬取其他课程的内容,callback是回调函数 yield Request(url,callback=self.parse)

items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class Project3Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title=scrapy.Field() stu=scrapy.Field() link=scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class Project3Pipeline(object): #初始化方法 def __init__(self): self.file=open("J:\\Program\\Python\\Python爬虫\\scrapy\\FirstScrapy\\project3\\content1\\1.txt","a") def process_item(self, item, spider): print(item["title"]) print(item["link"]) print(item["stu"]) print("---------------") #将内容写入文件中 self.file.write(item["title"][0]+"\n"+item["link"][0]+"\n"+item["stu"][0]+"\n"+"-------------"+"\n") return item #最后执行的方法 def close_spider(self): self.file.close()

settings.py

# -*- coding: utf-8 -*- # Scrapy settings for project3 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'project3' SPIDER_MODULES = ['project3.spiders'] NEWSPIDER_MODULE = 'project3.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'project3 (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'project3.middlewares.Project3SpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'project3.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'project3.pipelines.Project3Pipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

db.py

# -*- coding: utf-8 -*- ''' 自动模拟登陆豆瓣并解决验证码问题 ''' import scrapy from scrapy.http import Request,FormRequest import urllib.request class DbSpider(scrapy.Spider): name = "db" allowed_domains = ["douban.com"] # start_urls = ['http://douban.com/'] header={"User-Agent":'Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0'} def start_requests(self): #伪装成浏览器,回调函数,并设置cookie,保持登陆状态 return [Request("https://www.douban.com/accounts/login",callback=self.parse,meta={"cookiejar":1})] def parse(self, response): url="https://www.douban.com/accounts/login" #提取出验证码 captcha=response.xpath("//img[@id='captcha_image']/@src").extract() if len(captcha)>0: print=("此时有验证码") #将验证码提取到本地,手动输入,即半自动化方式 localpath="J:\\Program\\Python\\Python爬虫\\scrapy\\douban\\验证码\\captcha.png" #将验证码图片下载下来 urllib.request.urlretrieve(captcha[0],filename=localpath) print("请查看本地验证码图片并输入验证码") captcha_value=input() #需要发送给表单的数据,即自动登陆所需要账号密码等 data={ #账号字段 "form_email":"528497934@qq.com", #密码字段 "form_password":"sj528497934..", #验证码字段 "captcha-solution":captcha_value, #个人主页字段 "redir":"https://www.douban.com/people/174986115/", } else: print=("此时没有验证码") #需要发送给表单的数据,即自动登陆所需要账号密码等 data={ "form_email":"528497934@qq.com", "form_password":"sj528497934..", "redir":"https://www.douban.com/people/174986115/", } print("登录中。。。。。") return [FormRequest.from_response(response, #设置cookie,当我们登陆成功后将会保存其cookie meta={"cookiejar":response.meta["cookiejar"]}, #设置头,伪装成浏览器 headers=self.header, #设置提交的表单信息 formdata=data, #设置回调函数 callback=self.next,)] def next(self,response): print("登陆完成并爬取了个人中心的数据") title=response.xpath("/html/head/title/text()").extract() note=response.xpath("//div[@class='note']/text()").extract() print(title[0]) print(note[0])

items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class DoubanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass

pipelines.py

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class DoubanPipeline(object): def process_item(self, item, spider): return item

settings.py

# -*- coding: utf-8 -*- # Scrapy settings for douban project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban' SPIDER_MODULES = ['douban.spiders'] NEWSPIDER_MODULE = 'douban.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'douban (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'douban.middlewares.DoubanSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'douban.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'douban.pipelines.DoubanPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

dd.py

# -*- coding: utf-8 -*- ''' 爬取当当网书籍信息并写入到mysql数据库 ''' import scrapy from dangdang.items import DangdangItem from scrapy.http import Request class DdSpider(scrapy.Spider): name = "dd" allowed_domains = ["dangdang.com"] start_urls = ['http://dangdang.com/'] def parse(self, response): items=DangdangItem() items["title"]=response.xpath("//a[@class='pic']/@title").extract() items["link"]=response.xpath("//a[@class='pic']/@href").extract() items["comment"]=response.xpath('//a[@class="search_comment_num"]/text()').extract() yield items#交给pipelines中的item处理 for i in range(2,3): url="http://search.dangdang.com/?key=python&act=input&page_index="+str(i)+"#J_tab" yield Request(url,callback=self.parse)

items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class DangdangItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title=scrapy.Field() link=scrapy.Field() comment=scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymysql import chardet class DangdangPipeline(object): def process_item(self, item, spider): #连接数据库 # conn=pymysql.connect(host="127.0.0.1",port=3306,user="root",passwd="123456",db="dd",charset="utf-8") #不能写成utf-8,不认识,记得加上端口号 conn = pymysql.connect( host='127.0.0.1', port=3306, user='root', passwd='123456', db='dd', charset="utf8") cur=conn.cursor() for i in range(0,len(item["title"])): title=(item["title"][i]) link=(item["link"][i]) comment=(item["comment"][i]) cur.execute('insert into book(title,link,comment) values(%s,%s,%s)',(""+title+"",""+link+"",""+comment+"")) conn.commit() cur.close() conn.close() return item

settings.py

# -*- coding: utf-8 -*- # Scrapy settings for dangdang project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'dangdang' SPIDER_MODULES = ['dangdang.spiders'] NEWSPIDER_MODULE = 'dangdang.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'dangdang (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'dangdang.middlewares.DangdangSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'dangdang.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'dangdang.pipelines.DangdangPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

PhantomJs的使用

#!/usr/bin/python # -*-coding:utf-8-*- # __author__ = 'ShenJun' ''' PhantomJs主要用于抓取隐藏 ''' from selenium import webdriver import time import re from lxml import etree bs=webdriver.PhantomJS()#打开浏览器 time.sleep(3) url="http://www.baidu.com" bs.get(url)#访问url #将访问到的网页保存为一个图片 # bs.get_screenshot_as_file("J:\\Program\\Python\\爬虫笔记\\图片\\phantomjs.png") data=bs.page_source#获取访问到页面的源代码 ''' #通过正则表达式匹配title pattitle="<title>(.*?)</title>" title=re.compile(pattitle).findall(data) print(title) ''' ''' 如何想在urllib或者phantomjs中使用xpath表达式 需要将data即源码转为tree,再进行xpath提取 ''' edata=etree.HTML(data) title2=edata.xpath("/html/head/title/text()") print(title2) #将源代码保存到test fh=open("J:\\Program\\Python\\爬虫笔记\\test.html","wb") fh.write(data.encode("utf-8")) fh.close() bs.quit()#关闭浏览器

Bloom过滤器去重

#!/usr/bin/python # -*-coding:utf-8-*- # __author__ = 'ShenJun' import bloom ''' 主要用于去重url ''' url="www.baidu.com" # 申请一个内存等于1000000k的容器,错误率等于0.001的bloomfilter对象 bloo=bloom.BloomFilter(0.001, 1000000) element = 'weiwei' bloo.insert_element(element) print(bloo.is_element_exist('weiwei'))
转载请注明原文地址: https://www.6miu.com/read-2800138.html

最新回复(0)