命令行工具

xiaoxiao2021-02-28 144

Configuration settings

Scrapy将在标准位置的ini样式文件scrapy.cfg中查找配置参数：

/etc/scrapy.cfg或c:\scrapy\scrapy.cfg（系统范围），~/.config/scrapy.cfg（$XDG_CONFIG_HOME）和~/.scrapy.cfg（$HOME）用于全局（用户范围）设置scrapy.cfg 在一个项目的根（见下一节）。

这些文件的设置按列出的首选顺序进行合并：用户定义的值的优先级高于系统默认值，而在定义时，项目范围的设置将覆盖所有其他设置。Scrapy also understands, and can be configured through, a number of environment variables. Currently these are:

SCRAPY_SETTINGS_MODULE (see Designating the settings)SCRAPY_PROJECTSCRAPY_PYTHON_SHELL (see Scrapy shell)

Default structure of Scrapy projects

在深入使用命令行工具及其子命令之前，首先了解Scrapy项目的目录结构。 scrapy.cfg myproject/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py spider1.py spider2.py ... scrapy.cfg 文件所在的目录称为项目根目录。该文件包含项目设置的python模块的名称。这是一个例子： [settings] default = myproject.settings

Using the scrapy tool

直接在命令行运行 scrapy，会提示一些可以选择的命令，选项，参数等信息创建我们的项目myproject在project_dir 目录下 scrapy startproject myproject [ project_dir ]

你进入新的项目目录：

cd project_dir 创建一个新的蜘蛛：在当前目录下生成一个mydomain.py的模板文件 scrapy genspider mydomain mydomain.com 一些Scrapy命令（如crawl）必须从Scrapy项目中运行。记住，您可以随时通过运行以下命令获取有关每个命令的更多信息 scrapy <command> -h 例如：scrapy view -h 您可以看到所有可用的命令： scrapy -h

全局命令：不需要创建项目，不用指定项目

startproject scrapy startproject <project_name> [project_dir]genspider scrapy genspider [-t template] <name> <domain>Usage example: $ scrapy genspider -l Available templates: basic crawl csvfeed xmlfeed $ scrapy genspider example example.com Created spider 'example' using template 'basic' $ scrapy genspider -t crawl scrapyorg scrapy.org Created spider 'scrapyorg' using template 'crawl' settings $ scrapy settings --get BOT_NAME scrapybot $ scrapy settings --get DOWNLOAD_DELAY 0 runspider $ scrapy runspider myspider.py [ ... spider starts crawling ... ] shellfetch scrapy fetch <url> $ scrapy fetch --nolog http://www.example.com/some/page.html [... html内容...] $ scrapy fetch --nolog - headers http://www.example.com/ {'Accept-Ranges'：['bytes']， '年龄'：['1263']， '连接'：['close']， 'Content-Length'：['596']， 'Content-Type'：['text / html; 字符集= UTF-8' ]， '日期'：['Wed，18 Aug 2010 23:59:46 GMT']， 'Etag'：['“573c1-254-48c9c87349680”'] 'Last-Modified'：['Fri，30 Jul 2010 15:30:18 GMT']， '服务器'：['Apache / 2.2.3（CentOS）']} view $ scrapy shell http://www.example.com/some/page.html [...刮壳开始...] $ scrapy shell --nolog http://www.example.com/ -c'（response.status，response.url）' （200，'http://www.example.com/'）＃shell默认情况下遵循HTTP重定向 $ scrapy shell --nolog http://httpbin.org/redirect-to?url=http://example.com/ -c'（response.status，response.url）' （200，'http://example.com/'）＃你可以使用--no-redirect禁用它＃（仅适用于作为命令行参数传递的URL） $ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http://example.com/ -c'（response.status，response.url）' （302，'http://httpbin.org/redirect-to?url=http://example.com/'） $ scrapy view http://www.example.com/some/page.html [...浏览器启动...] version

项目命令：项目要已经创建好，需要指定

crawl scrapy crawl <spider> 运行爬虫

Usage examples:

$ scrapy crawl myspider [ ... myspider starts crawling ... ] check scrapy check [-l] <spider> $ scrapy check -l first_spider * parse * parse_item second_spider * parse * parse_item $ scrapy check [FAILED] first_spider:parse_item >>> 'RetailPricex' field is missing [FAILED] first_spider:parse >>> Returned 92 requests, expected 0..4 list scrapy list

Usage example:

$ scrapy list spider1 spider2 edit scrapy edit <spider>parse $ scrapy parse http://www.example.com/ -c parse_item [... scrapy log lines crawling example.com spider ...] >>>状态深度1 <<< ＃Scraped Items ----------------------------------------------- ------------- [{'name'：u'Example item'， 'category'：u'Furniture'， 'length'：u'12 cm'}] ＃要求 - - - - - - - - - - - - - - - - - - - - - - - - ----------------- [] bench

转载请注明原文地址: https://www.6miu.com/read-37352.html

技术

最新回复(0)