首先回顾一下获取首页最新文章列表[[a,title],…]:
def getPaperList(): url = 'https://economist.com' req = urllib.request.Request(url=url,headers=headers, method='GET') response = urllib.request.urlopen(req) html = response.read() selector = etree.HTML(html.decode('utf-8')) goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]/div[1]/div[1]/div[3]/ul[1]/li' art=selector.xpath(goodpath) awithtext = [] try: for li in art: ap = li.xpath('article[1]/a[1]/div[1]/h3[1]/text()') a = li.xpath('article[1]/a[1]/@href') awithtext.append([a[0],ap[0]]) except Exception as err: print(err,'getMain') finally: return awithtext1、接着分析要爬取的文章的html结构
上图中标注分别为: 1.flytitle-and-title__flytitle 2.real title 3.description 4.所有同一DOM级别的P元素包含的就是文章的主体段落2、爬取文章内容:
def getPaper(url): req = urllib.request.Request(url=url,headers=headers, method='GET') response = urllib.request.urlopen(req) html = response.read() selector = etree.HTML(html.decode('utf-8')) goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]//div[2]/div[1]/article' article=selector.xpath(goodpath) return article3、 获取标记1,2,3相关信息得到[1,2,3]:
def getHeadline(article): headline = [] try: h1 = article[0].xpath('h1/span') for item in h1: headline.append(item.text) p1 = article[0].xpath('p[1]/text()') headline.append(p1[0]) except Exception as err: print(err,'getHeadline') finally: return headline4、获取文章内容p=[p,p,p….]:
def getContent(article): parr = [] try: p = article[0].xpath('div[1]/div[3]/p/text()') for i in p: print(i) parr.append(i+'\n') except Exception as err: print(err,'getContent') finally: return parr5、爬虫请开始表演
if __name__ == '__main__': linkArr = getMain() time.sleep(10) tmpLast = [] toDayDir = './mds/' + todayDate +'/papers/' if not os.path.exists(toDayDir): os.makedirs(toDayDir) for item in linkArr: if item[0] not in lastLst: tmpLast.append(item[0]) url = 'https://economist.com' + item[0] article = getPaper(url) headLine = getHeadline(article) try: paperRecords[strY][strM][strD].append([item[0],headLine[1]]) content = getContent(article) paperName = '_'.join(item[1].split(' ')) saveMd = toDayDir + paperName+'.md' result = headLine[1:] result.extend(content) output = '\n'.join(result) with open(saveMd,'w') as fw: fw.write(output) time.sleep(10) except Exception as err: print(err) paperRecords['lastLst'] = tmpLast with open('spiRecords.json','w') as fwp: json.dump(paperRecords,fwp)6、对5中的部分数据结构进行讲解:
首先是归档目录结构:
mds/2018_04_29/papers#日期papers目录都是生成结果时创建的然后是爬取记录保存的json文件结构,直接给例子好了:
{"a2018": {"a4": {"a29": [ ["/blogs/graphicdetail/2018/04/daily-chart-18", "Success is on the cards for Nintendo"] ] } }, "lastLst": ["/blogs/graphicdetail/2018/04/daily-chart-18","/blogs/buttonwood/2018/04/affording-retirement"] }保存lastLst是为了不重复爬取,当然也可以遍历所有数据剔除重复,但是代价有点大,而且代码要写好长一串太麻烦还没啥明显优点。
对于文章的爬取到这里就算结束了,下一篇将讲述文章中的单词如何去重得重
最后进入今天的阅时即查文章环节