关于scrapy新闻爬虫，对新闻网页内容进行编辑的问题

xiaoxiao2021-02-28 78

一般内容

一般某一个网站的新闻页面，标题，作者，日期这几个一般肯定是固定格式的，意思就是同一个网站的新闻“标题”，“作者”都会放在固定html标签里，且标签的class或者id都是有理可依的，而新闻内容一般也会放到一个固定id或者class的div里，已投资界的新闻网页为例，如这篇文章：徒子文化完成数千万人民币A轮融资，腾讯出资

获取标题，作者，内容的代码如下：

# 拼接字符串数组到一个字符串 def get_text(self, texts): text = "" if len(texts) > 0: for tmp in texts: text = text + tmp return text.strip() def parse(self, response): # 获取标题 article_titles = response.xpath('//h1[@id="newstitle"]/text()').extract() if (article_titles.count > 0): print "article_title:" + article_titles[0] item["article_title"] = article_titles[0] # 获取日期 article_times = response.xpath('//div[@class="info"]/div[@class="box-l"]/span[@class="date"]/text()').extract() item["article_time"] = self.get_text(article_times) # print "article_time:" + item["article_time"] # 获取作者和文章来源 article_authors = response.xpath('//div[@class="info"]/div[@class="box-l"]/text()').extract() author = self.get_text(article_authors) author_list = author.split() item["article_src"] = author_list[0] if len(author_list) > 1: item["article_author"] = author_list[1] else: item["article_author"] = "" # print "article_src:" + item["article_src"] # print "article_author:" + item["article_author"] # 获取文章概要 article_summarys = response.xpath('//div[@class="news-show-r"]/div[@class="subject"]/text()').extract() item["article_summary"] = self.get_text(article_summarys) # print "article_summary:" + item["article_summary"]

图片替换

对于新闻内容中间的图片，有时候我们图片是从原网站的链接，而是我们自己的服务器，这时候需要遍历新闻内容中所有的图片并将图片的链接地址替换掉，其中img是scrapy中的Selector对象，而img.root是lxml中的Element对象，有兴趣的可以去查看下LXml的相关资料，代码如下：

# 查出html中所有的img article_content_imgs = response.xpath('//div[@id="news-content"]//img') # 下载、替换微信文章中的图片 for img in article_content_imgs: # 取出img中src img_src = self.get_text(img.xpath('./@src').extract()) # 取出后缀如jpeg或png img_type = img_src.split(".")[-1] # 如果是bmp、jpg、png、jpeg之外的格式默认为jpg if img_type.lower().strip() != "bmp" and img_type.lower().strip() != "jpg" and img_type.lower().strip() != "png" and img_type.lower().strip() != "jpeg": img_type = "jpg" # 取时间戳 timestamp = self.get_time_stamp() # 这个是网页html中图片的相对路径 abs_path = u'/Uploads/Spider/News/Content/' + timestamp + '.' + img_type # 这个是文件系统中图片的完整路径 save_path = u'D:/UserUploadFiles勿删' + abs_path # 如果原html中图片路径不是http开头，则要将原图片路径地址补全，以用来下载 if False == img_src.startswith("http"): img_src = response.urljoin(img_src) # 下载原图片到文件系统 urllib.urlretrieve(img_src.encode("utf8"), save_path) # 替换img中的src，替换成我们自己的相对路径地址 img.root.attrib['src'] = abs_path

要修改LXml中img的src属性后，再取出内容，这时候div(id=news-content)中所有的img就被我们替换了

article_contents = response.xpath('//div[@id="news-content"]').extract()

删除新闻内容里面不必要的内容

# 删除是logo-head123.png的img节点 head_imgs = response.xpath('//div[@class="content"]//img[contains(@src, "logo-head123.png")]') for img in head_imgs: img.root.drop_tree() # 删除包含文本 “授权请联络”的p节点 bottom_ps = response.xpath(u'//div[@class="content"]//p[contains(./text(), "授权请联络")]') for p in bottom_ps: p.root.drop_tree() # 删除包含"商务合作"字样的节点的整个父节点 bottom_ps = response.xpath(u'//div[@class="content"]//span[contains(./text(), "商务合作")]') for p in bottom_ps: p.root.getparent().drop_tree()

转载请注明原文地址: https://www.6miu.com/read-76867.html

技术

最新回复(0)