qcy教你下载你在6miu上写过的文章——第三篇……

xiaoxiao2021-02-27  188

好像知道我在爬它的页面一样。

上次写的爬文章的代码已经用不了了!

它好像找到了我说的那个,在地址栏请求http://blog.csdn.net/qcyfred/article/list/100000000

就会把很多文章放到同一个页面返回的这个BUG。。。

总之,上次写的不能用了!

没有关系,你升级,我还不是可以升级。

# -*- coding: utf-8 -*- """ Created on Thu Apr 27 00:03:25 2017 @author: qcy """ import re import pandas as pd import numpy as np import urllib user_name = 'qcyfred' headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} art_url = [] art_titles = [] art_dic = {} try: k = 0 while True: k = k+1 req = urllib.request.Request(url = 'http://blog.csdn.net/%s/article/list/%d'%(user_name,k),headers = headers) data = urllib.request.urlopen(req).read() data = data.decode('utf-8') all_the_text = data p1 = "/%s/article/details/\d{1,}"%user_name pattern1 = re.compile(p1) ids = pattern1.findall(all_the_text) ids = list(set(ids)) for i in range(len(ids)): s1 = ids[i] p2 = '<a href="'+s1+'">([\w\W]+?)</a>' pattern2 = re.compile(p2) list2 = pattern2.findall(all_the_text) if len(list2)==0: continue s2 = list2[0] res = s2.replace(' ','') print(res) art_dic[s1] = res art_url.append('http://blog.csdn.net'+s1) art_titles.append(res) except Exception as e: print(e) df1 = pd.DataFrame(art_url,columns=['url']) df2 = pd.DataFrame(art_titles,columns=['title']) df = pd.concat([df1,df2],axis=1) df.sort_values('url',inplace=True) df.index = np.arange(0+1,len(df)+1) df.index.name = 'article_id' df.to_excel('article.xlsx')

Done!!

转载请注明原文地址: https://www.6miu.com/read-12880.html

最新回复(0)