Step06:selenium+beautifulsoup爬取智联岗位信息,存储至Excelcsv文件

xiaoxiao2021-07-05  312

爬取智联岗位信息

本次使用开发环境python 3.6.5+Pycharm,当然此次代码仅供参考。 详细代码地址:我的github下载

1.目标站点网页源代码获取

由于使用Firefox浏览器,所以需要下载其驱动:geckodriver.exe,并设置该exe文件在win系统环境变量下。

def get_content(arcurl): browser = webdriver.Firefox() browser.get(arcurl) html = browser.page_source browser.close() return html

2.解析网页,获取所需数据

本次解析尝试使用beautifulsoup库,具体api可查询beautifulsoup官方文档,有详细阐述。

def parse_page_shezhao(html): #print(html) soup = BeautifulSoup(html, "lxml") message = [] message_dict = [] div_list = soup.select('#listContent > div') for div in div_list: messdict = {} div_infobox = div.select('div.listItemBox > div.infoBox') if len(div_infobox) > 0: nameBox = div_infobox[0].select('.nameBox > div.jobName') if len(nameBox) > 0: jobname = nameBox[0].get_text() job_link = nameBox[0].select('a')[0].attrs['href'] companyBox = div_infobox[0].select('.nameBox > div.commpanyName') if len(companyBox) > 0: company_name = companyBox[0].get_text() company_link = companyBox[0].select('a')[0].attrs['href'] jobDesc = div_infobox[0].select('.descBox > div.jobDesc') if len(jobDesc) > 0: jobadr = jobDesc[0].get_text() commpanyDesc = div_infobox[0].select('.descBox > div.commpanyDesc') if len(commpanyDesc) > 0: jobadr += " " + commpanyDesc[0].get_text() job_welfare = div_infobox[0].select('div > div.job_welfare > div') desc = "" for xvar in job_welfare: desc += xvar.get_text() + "; " commpanyStatus = div_infobox[0].select('div > div.commpanyStatus') desc += "【" + commpanyStatus[0].get_text() + "】" messdict['职位链接'] = job_link messdict['职位']=jobname messdict['公司'] = jobname messdict['公司链接'] = company_name messdict['相关性质'] = jobadr messdict['职责描述'] = desc message.append([job_link, jobname, company_name, company_link, jobadr, desc]) message_dict.append(messdict) return message, message_dict

3.保存数据至文件

第一种,保存至csv文件,需要pip3 install csv。

def write_csv_headers(path, headers): with open(path, 'a', encoding='gb18030', newline='') as f: f_csv = csv.DictWriter(f, headers) f_csv.writeheader() def write_csv_rows(path, headers, rows): ''' 写入行 ''' with open(path, 'a', encoding='gb18030', newline='') as f: f_csv = csv.DictWriter(f, headers) f_csv.writerows(rows) def csv_write(csv_name, headers, html): write_csv_headers(csv_name, headers) items, others = parse_page_shezhao(html) write_csv_rows(csv_name, headers, others)

第二种,保存至xls文件,即Excel文件,需要pip3 install xlwt。

def excel_write(filename, html): # 创建excel文件,声明编码为utf-8 wb = xlwt.Workbook(encoding='utf-8') # 创建表格 ws = wb.add_sheet('sheet1') # 表头信息 headData = ['LINK_URL', '职位', '公司', '公司链接', '相关性质', '职责描述'] # 写入表头信息 for colnum in range(0, len(headData)): ws.write(0, colnum, headData[colnum], xlwt.easyxf('font: bold on')) # 从第2行开始写入 index = 1 #for item in parse_page(html): items, others = parse_page_shezhao(html) for item in items: print(item) for i in range(0, len(headData)): # .write(行,列,数据) ws.write(index, i, item[i]) index += 1 # 保存excel wb.save(filename)

4.main

def main(): url = input("请输入URL:\n") filename = input("请输入存储文件名:\n") html = get_content(url) csv_name = filename + '.csv' headers = ['职位链接', '职位', '公司', '公司链接', '相关性质', '职责描述'] csv_write(csv_name, headers, html) excel_name = filename + ".xls" #excel_write("智联招聘岗位爬虫结果.xls", html) excel_write(excel_name, html)

5.成果展示

执行文件,输入url和存储文件名:

本次测试url:https://sou.zhaopin.com/?pageSize=60&jl=765&sf=10001&st=15000&kw=java&kt=3&=10001 本次测试文件名:java岗位信息表

输出结果:

转载请注明原文地址: https://www.6miu.com/read-4821447.html

最新回复(0)