BeautifulSoup学习笔记2

xiaoxiao2021-02-28  118

1 Searching the tree –filters

还是用“爱丽丝梦游仙境”的代码作为例子

>>> html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

上篇笔记中有通过标签名字搜索标签。 标签有许多属性,name是其中的一个。 可以发现, .name的方式,只获取了当前名字的第一个标签:

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html") >>> soup.a <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

find_all()方法可以获取所有的 a 标签。

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html") >>> soup.a <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> >>> >>> soup.find_all('a') [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] >>> >>> # 还有一个findAll(),结果一样 >>> soup.findAll('a') [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] >>>

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件。

Kinds of filters: 过滤器的种类一般有四种: A string; A regular Expression; A list; A function 。

1.1 字符串 字符串是最简单的过滤器,上文中就是在搜索方法中传入了字符串参数a,BeautifulSoup会查找与字符串匹配的内容。

>>> soup.find_all('b') [<b>The Dormouse's story</b>] >>> soup.find_all('title') [<title>The Dormouse's story</title>] >>>

1.2 正则表达式 如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.

[<title>The Dormouse's story</title>] >>> import re >>> tags = soup.find_all(re.compile("^b")) >>> tags [<body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body>, <b>The Dormouse's story</b>] >>> >>> for tag in tags: print(tag.name) body b >>> >>> text1 = soup.find_all(text=re.compile("sisters")) >>> text1 ['Once upon a time there were three little sisters; and their names were\n']

1.3 列表

>>> soup.find_all(['a','b']) [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] >>>

1.4 True True 可以匹配任何值

>>> for tag in soup.find_all(True): print(tag.name) html head title body p b p a a a p >>>

1.5 方法 可以自己定义一个方法作为过滤器:

>>> from bs4 import NavigableString >>> def surrounded_by_strings(tag): return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString)) >>> for tag in soup.find_all(surrounded_by_strings): print(tag.name) body p a a a p >>>

下一篇笔记会详细介绍find_all()方法。

转载请注明原文地址: https://www.6miu.com/read-40540.html

最新回复(0)