BeautifulSoup学习笔记2

xiaoxiao2021-02-28 118

1 Searching the tree –filters

还是用“爱丽丝梦游仙境”的代码作为例子

>>> html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """

上篇笔记中有通过标签名字搜索标签。标签有许多属性，name是其中的一个。可以发现， .name的方式，只获取了当前名字的第一个标签：

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html") >>> soup.a <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

find_all()方法可以获取所有的 a 标签。

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html") >>> soup.a <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> >>> >>> soup.find_all('a') [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] >>> >>> # 还有一个findAll(),结果一样 >>> soup.findAll('a') [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] >>>

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件。

Kinds of filters：过滤器的种类一般有四种： A string； A regular Expression； A list； A function 。

1.1 字符串字符串是最简单的过滤器，上文中就是在搜索方法中传入了字符串参数a，BeautifulSoup会查找与字符串匹配的内容。

>>> soup.find_all('b') [The Dormouse's story] >>> soup.find_all('title') [<title>The Dormouse's story</title>] >>>

1.2 正则表达式如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.

[<title>The Dormouse's story</title>] >>> import re >>> tags = soup.find_all(re.compile("^b")) >>> tags [<body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. ... </body>, The Dormouse's story] >>> >>> for tag in tags: print(tag.name) body b >>> >>> text1 = soup.find_all(text=re.compile("sisters")) >>> text1 ['Once upon a time there were three little sisters; and their names were\n']

1.3 列表

>>> soup.find_all(['a','b']) [The Dormouse's story, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] >>>

1.4 True True 可以匹配任何值

>>> for tag in soup.find_all(True): print(tag.name) html head title body p b p a a a p >>>

1.5 方法可以自己定义一个方法作为过滤器：

>>> from bs4 import NavigableString >>> def surrounded_by_strings(tag): return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString)) >>> for tag in soup.find_all(surrounded_by_strings): print(tag.name) body p a a a p >>>

下一篇笔记会详细介绍find_all()方法。

转载请注明原文地址: https://www.6miu.com/read-40540.html

技术

最新回复(0)