前段时间项目使用了scrapy,这里做个简单的demo,使用scrapy抓取下安居客的内容, 关于怎么搭建scrapy的工程和scrapy部分概念的使用,请查看https://blog.csdn.net/mingover/article/details/80717974
https://github.com/huawumingguo/scrapy_demo
我们发现安居客查看是不用登陆的,我们把内容搞简单些,只抓取一个城市的内容,如广州: https://guangzhou.anjuke.com/sale/rd1/?kw=&from=sugg
上面是一个房源列表,里面有下一页,我们对第一页进行抓取,然后把详情 url,和下一页url都加到抓取目标中来.
pipeline拿到相关内容后,写入mysql/其它持久化
直接代码:
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from urllib import parse from scrapy.loader import ItemLoader from anjuke.items import * import re import logging from anjuke.utils.LoggerUtil import LogUtils loggerDataName = "anjuke" log_dataInfo_path = "logs/anjuke.log" log = LogUtils.createLogger(loggerDataName, log_dataInfo_path) class AnjukeGzSpider(scrapy.Spider): name = 'anjuke_gz' allowed_domains = ['anjuke.com'] start_urls = ['https://guangzhou.anjuke.com/sale/'] headers = { "HOST": "guangzhou.anjuke.com", "Referer": "https://guangzhou.anjuke.com", 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0" } def parse(self, response): items = response.css("#houselist-mod-new > li") log.info("正在处理_list:%s,条数:%s" %( response.url ,len(items))) # 获取列表 for itemnode in items: itemurl = itemnode.css('a.houseListTitle::attr(href)').extract() itemtitle = itemnode.css('a.houseListTitle::attr(title)').extract() log.info("list中显示:%s,标题:%s" % (itemurl, itemtitle)) # print('%s,链接为:%s' % (itemtitle[0], itemurl[0])) yield Request(url=parse.urljoin(response.url, itemurl[0]), meta={"refer_url": response.url}, callback=self.parse_detail, headers=self.headers) # 获取下一页 nexturlArr = response.css('#content > div.sale-left > div.multi-page > a.aNxt::attr(href)').extract() if nexturlArr: nexturl = nexturlArr[0] yield Request(url=parse.urljoin(response.url, nexturl), callback=self.parse, headers=self.headers) def parse_detail(self, response): # "https://guangzhou.anjuke.com/prop/view/A1285389340?from=filter&spread=commsearch_p&position=361&kwtype=filter&now_time=1529215280" refer_url = response.meta.get("refer_url", '') log.info("正在处理_detail:%s,refer_url为:%s" %( response.url,refer_url)) match_re = re.match(".*view/(.*)\?.*", response.url) id = match_re.group(1) citemloader = OrderItemLoader(item=OrderItem(), response=response) citemloader.add_value("id", id) citemloader.add_css("title", "#content > div.clearfix.title-guarantee > h3::text") citemloader.add_css("community_id", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(1) > dd > a::attr(href)") citemloader.add_css("community_name", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(1) > dd > a::text") citemloader.add_css("area1", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(2) > dd > p > a:nth-child(1)::text") citemloader.add_css("area2", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(2) > dd > p > a:nth-child(2)::text") citemloader.add_css("build_time", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(3) > dd::text") citemloader.add_css("address", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(2) > dd > p::text") citemloader.add_css("housetype", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.first-col.detail-col > dl:nth-child(4) > dd::text") citemloader.add_css("housestructure", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(1) > dd::text") citemloader.add_css("space", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(2) > dd::text") citemloader.add_css("building_floors", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(4) > dd::text") citemloader.add_css("house_floor", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(4) > dd::text") citemloader.add_css("direction_face", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.second-col.detail-col > dl:nth-child(3) > dd::text") citemloader.add_css("unit_price", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.third-col.detail-col > dl:nth-child(1) > dd::text") citemloader.add_css("consult_first_pay", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.third-col.detail-col > dl:nth-child(2) > dd::text") citemloader.add_css("decoration_degree", "#content > div.wrapper > div.wrapper-lf.clearfix > div.houseInfoBox > div > div.houseInfo-wrap > div > div.third-col.detail-col > dl:nth-child(4) > dd::text") import datetime nowTime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') citemloader.add_value("refer_url", refer_url) citemloader.add_value("now_time", nowTime) item = citemloader.load_item() yield item pass这里花时间最多的是解析各种css,主要是在item中的正则表达式中处理
这里itemloader就是解析html的内容,要有耐心慢慢调,这里就是最花时间的地方
持久化可以有多种方式了, json或mysql之类的都行,这里提供几种:
class JsonWithEncodingPipeline(object): #自定义json文件的导出 def __init__(self): self.file = codecs.open('anjuke_orderlist.json', 'w', encoding="utf-8") def process_item(self, item, spider): lines = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(lines) self.file.flush() return item def spider_closed(self, spider): self.file.close() class MysqlPipeline(object): #采用同步的机制写入mysql def __init__(self): self.conn = MySQLdb.connect('192.168.0.106', 'root', 'root', 'article_spider', charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = """ insert into jobbole_article(title, url, create_date, fav_nums) VALUES (%s, %s, %s, %s) """ self.cursor.execute(insert_sql, (item["title"], item["url"], item["create_date"], item["fav_nums"])) self.conn.commit() class MysqlTwistedPipline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls, settings): dbparms = dict( host = settings["MYSQL_HOST"], db = settings["MYSQL_DBNAME"], user = settings["MYSQL_USER"], passwd = settings["MYSQL_PASSWORD"], charset='utf8', cursorclass=MySQLdb.cursors.DictCursor, # use_unicode=True, ) dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms) return cls(dbpool) def process_item(self, item, spider): #使用twisted将mysql插入变成异步执行 query = self.dbpool.runInteraction(self.do_insert, item) query.addErrback(self.handle_error, item, spider) #处理异常 def handle_error(self, failure, item, spider): #处理异步插入的异常 print (failure) def do_insert(self, cursor, item): #执行具体的插入 #根据不同的item 构建不同的sql语句并插入到mysql中 insert_sql, params = item.get_insert_sql() # targetparams = [] # try: # for param in params: # if isinstance(param, str): # targetparam = param.encode('utf8').decode('utf8') # targetparams.append(targetparam) # else: # targetparams.append(targetparam) # except Exception as e: # print (e) # raise e insert_sql = insert_sql.strip() try: cursor.execute(insert_sql, params) except Exception as e: print (e) passworkon 默认的目录是C:\Users\xxxx\Envs 把WROKON_HOME的路径自定义到指定目录即可 之前已经定义好的workon 则在老目录中直接COPY过去即可
参考: http://www.bubuko.com/infodetail-2145085.html
scrapy shell -s USER_AGENT=”Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36” https://guangzhou.anjuke.com/prop/view/A1275284529?from=filter&spread=commsearch_p&position=1&kwtype=filter&now_time=1529290160
我们用第三方工具来搞,的downloadmiddleware
class RandomUserAgentMiddlware(object): #随机更换user-agent def __init__(self, crawler): super(RandomUserAgentMiddlware, self).__init__() self.ua = UserAgent() self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def get_ua(): return getattr(self.ua, self.ua_type) tmpua = get_ua() request.headers.setdefault('User-Agent', tmpua)实际上,代理就是request.meta[“proxy”] = “xxxx” 加上就行了,不过一般网上的代理都是一段时间就失效的,我们需要构建一个代理池,来动态获取,遇到不行的,就略过. 参考代码中的GetProxy.py
response.css 如果是里面想抓所有的html , ::text是只能获取文字的,html内容不会被抓取 那么,不加::text即可
https://blog.csdn.net/cjeric/article/details/73518782
https://www.cnblogs.com/geekard/archive/2012/10/04/python-string-endec.html
scrapy crawl anjuke_gz -s JOBDIR=jobs/job1
如果要关闭,就ctrl + c
是的,关掉重启后,指定同一个job,会在之前的基础上去抓取
发现安居客一个奇怪的问题,找了几个城市,发现个个城市都是50页,感觉安居客的房源不太OK,这些代码只可以留来简单学习。 https://www.zhihu.com/question/20594581/answer/15585452
安装sqlite3 sqlite3有点兼容问题,不能通过pip搞定 https://blog.csdn.net/sparkexpert/article/details/69944835 https://github.com/sloria/TextBlob/issues/173
实际上就是安装下libsqlite3-dev,然后重新编译python3
1,sudo apt-get install libsqlite3-dev 2,(这一步这边没有执行)(Or you can install more packages as suggested on the pyenv wiki: apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev 3,Now in the downloaded python source rebuild and install python with the following command: ./configure --enable-loadable-sqlite-extensions && make && sudo make install安装 Twisted,这货也有个bug,不能直接pip安装. https://pypi.org/project/Twisted/#files 这里下载后手动安装,安装过程百度下
pip install scrapy pip install fake_useragent pip install requests pip install kafka pip install redis pip install pillow apt-get install libmysqlclient-dev pip install mysqlclienthttps://coding.imooc.com/class/92.html