BeautifulSoup学习笔记1

xiaoxiao2021-02-28 140

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是有个Python对象，对象可以归纳为4种： Tag； NavigableString； BeautifulSoup； Comment。

下面的一段HTML代码是爱丽丝梦游仙境的一段内容，将作为例子来介绍BeautifulSoup的对象（BeautifulSoup 库的名字取自刘易斯 ·卡罗尔在《爱丽丝梦游仙境》里的同名诗歌。)。

>>> html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """

1 BeautifulSoup

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html.parser") >>> soup <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. ... </body></html> >>>

BeautifulSoup 对象表示的是一个文档的全部内容. 大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法. 因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name：

>>> soup.name '[document]' >>>

2 Tag

Tag对象和XML,HTML文档中的标签相同：

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html.parser") >>> tag1 = soup.a >>> tag1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> >>> type(tag1) <class 'bs4.element.Tag'> >>>

Tag有很多方法和属性：

>>> dir(tag1) ['HTML_FORMATTERS', 'XML_FORMATTERS', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_attr_value_as_string', '_attribute_checker', '_find_all', '_find_one', '_formatter_for_name', '_is_xml', '_lastRecursiveChild', '_last_descendant', '_select_debug', '_selector_combinators', '_should_pretty_print', '_tag_name_matches_and', 'append', 'attribselect_re', 'attrs', 'can_be_empty_element', 'childGenerator', 'children', 'clear', 'contents', 'decode', 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 'extract', 'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 'find_previous_siblings', 'format_string', 'get', 'getText', 'get_attribute_list', 'get_text', 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 'isSelfClosing', 'is_empty_element', 'known_xml', 'name', 'namespace', 'next', 'nextGenerator', 'nextSibling', 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 'parent', 'parentGenerator', 'parents', 'parserClass', 'parser_class', 'prefix', 'preserve_whitespace_tags', 'prettify', 'previous', 'previousGenerator', 'previousSibling', 'previousSiblingGenerator', 'previous_element', 'previous_elements', 'previous_sibling', 'previous_siblings', 'quoted_colon', 'recursiveChildGenerator', 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 'replace_with_children', 'select', 'select_one', 'setup', 'string', 'strings', 'stripped_strings', 'tag_name_re', 'text', 'unwrap', 'wrap']

2.1 .name

Every tag has a name, accessible as .name:

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html.parser") >>> tag1 = soup.a >>> tag1.name 'a'

2.2 .attrs

A tag may have any number of attributes. You can access a tag’s attributes by treating the tag like a dictionary:

>>> tag1.attrs {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} >>> >>> tag1["class"] ['sister'] >>> tag1["class"] = "SISTER" >>> tag1 <a class="SISTER" href="http://example.com/elsie" id="link1">Elsie</a> >>> del tag1['id'] >>> tag1 <a class="SISTER" href="http://example.com/elsie">Elsie</a> >>> tag1['id'] Traceback (most recent call last): File "<pyshell#24>", line 1, in <module> tag1['id'] File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\element.py", line 1011, in __getitem__ return self.attrs[key] KeyError: 'id' >>>

2.3 Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html.parser") >>> tag2 = soup.a >>> tag2 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> >>> tag2['class'] = ['body','strikeout'] >>> tag2 <a class="body strikeout" href="http://example.com/elsie" id="link1">Elsie</a> >>> tag2['class'] ['body', 'strikeout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

>>> tag2['class'] = "body strikeout" >>> tag2 <a class="body strikeout" href="http://example.com/elsie" id="link1">Elsie</a> >>> tag2['class'] 'body strikeout'

(If you parse a document as XML, there are no multi-valued attributes.)

3 NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('Back to the <a rel="index">homepage</a>') >>> tag2 = soup.p >>> tag2 Back to the <a rel="index">homepage</a> >>> tag2.string 'homepage' >>> type(tag2.string) <class 'bs4.element.NavigableString'> >>>

字符串当然重要，大量的网页信息都字符串一般都包含在标签内，上面只是简单的获取标签内的字符串。

4 Comments and other special strings

Comment 对象是一个特殊类型的 NavigableString 对象:

>>> markup = "" >>> soup3 = BeautifulSoup(markup) >>> comment = soup3.b.string >>> comment 'Hey, buddy. Want to buy a used parser?' >>> type(comment) <class 'bs4.element.Comment'> >>>

当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:

>>> soup3.b.prettify() '\n \n'

Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: CData , ProcessingInstruction , Declaration , Doctype .与 Comment 对象类似,这些类都是 NavigableString 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:

>>> from bs4 import CData >>> cdata = CData("Paranoia") >>>> comment.replace_with(cdata) 'Hey, buddy. Want to buy a used parser?' >>> soup3 <![CDATA[Paranoia]]> >>> soup3.b.prettify() '\n <![CDATA[Paranoia]]>\n'

前两天就该整理出这些笔记，一直没心情。还有啊，以后就没有寒暑假了，就当多过一个暑假。把该学的东西学学，和家人多聚一聚，再找工作也是可以的

转载请注明原文地址: https://www.6miu.com/read-23329.html

技术

最新回复(0)