python爬虫（一）BeautifulSoup简介

xiaoxiao2021-02-28 114

BeautifulSoup库的名字取自刘易斯·卡罗尔在《爱丽丝漫游仙境》里的同名诗歌。BeautifulSoup通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的python对象展现XML结构信息。

一、安装Beautifulsoup

1、windows平台

①安装pip（安装python3时选择安装） ②利用pip安装bs4——命令行模式：`pip install BeautifulSoup4

二、运行Beautifulsoup

解析本地网页

from bs4 import BeautifulSoup #bs4是BeautifulSoup4的简称 with open(r'E:\PycharmProjects\web_prase\new_index.html') as web_data:#利用open函数打开本地网页文件 soup=BeautifulSoup(web_data.read(),'lxml')#利用lxml解析网页 print（soup.h2）

输出结果为：

<h2>Article</h2>

解析在线网页

from bs4 import BeautifulSoup import requests url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html' web_data=requests.get(url)#利用requests库爬取在线网页 soup=BeautifulSoup(web_data.text,'lxml') print（soup）

或者

from bs4 import BeautifulSoup from urllib.request import urlopen html=urlopen('https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html')#利用urllib模块爬取在线网页 soup=BeautifulSoup(html.read(),'lxml')

三、可靠的网络连接

html=urlopen('https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html')

这行代码主要可能会发生两种异常：

网页在服务器上不存在（或者获取页面的时候出现错误）服务器不存在

第一种异常发生时，程序会返回HTTP错误。HTTP错误可能是“404 Page Not Found”、“500 Internal Sever Error”等。所有类似情形，urlopen都会抛出“HTTPError”异常。可以用下面的方式处理这种异常：

try: html=urlopen('https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html') except HTTPError as e: print(e) #返回空值，中断程序，或者执行另一个方案 else： #程序继续。

如果服务器不存在（就是说https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html 打不开，或者URL链接写错了），urlopen会返回一个None对象。可以增加一个判断语句检测返回的html是不是None：

if html is None: print("URL is not found") else: #程序继续

四、复杂HTML解析

从复杂的网页中寻觅信息时，在找到目标信息之前，需要“敲掉”网页上那些不需要的信息。

通过属性查找标签的方法

CSS可以让HTML元素呈现出差异化，使那些具有完全相同修饰的元素呈现出不同的样式。比如，有一些标签看起来是这样：

<span>class="green"</span>

而另一些标签看起来是这样：

<span>class="red"</span>

网络爬虫可以通过class属性的值，轻松地区分出两种不同的标签。