BeautifulSoup学习笔记7

xiaoxiao2021-02-28  20

1 parsers

刚看bs4文档,就看到这样的代码:

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc,"html.parser") >>> soup.a <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

这里的html.parser是Python标准库中的HTML解析器。BeautifulSoup还支持第三方的解析器,比如lxml和html5lib。

一般来说,lxml解析速度最快,效果会更好。

截图来自https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

2 Encodings

每个网页都有一种特定的编码方式,国内常用的网页编码方式有utf-8和gb2312两种。 使用BeautifulSoup解析后,文档都会被转换成Unicode。

2.1 网页源码的前几行会有相关的编码信息:

<meta charset="utf-8"> <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

2.2 original_encoding属性记录了自动识别编码的结果

In [7]: soup.original_encoding Out[7]: 'gb2312'

2.3 通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码:

如果不想用utf-8格式输出,可以自己传入编码方式参数:

2.4 UnicodeDammit

A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.

>>> from bs4 import UnicodeDammit >>> dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") >>> print(dammit.unicode_markup) Sacré bleu! >>> dammit.original_encoding >>> >>> snowmen = (u"\N{SNOWMAN}" * 3) >>> quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") >>> doc = snowmen.encode("utf8") + quote.encode("windows_1252") >>> print(doc) b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\x93I like snowmen!\x94' >>> print(doc.decode("windows_1252")) ☃☃☃“I like snowmen!” >>> >>> >>> new_doc = UnicodeDammit.detwingle(doc) >>> print(new_doc.decode('utf-8')) ☃☃☃“I like snowmen!” >>>

UnicodeDammit.detwingle() 方法能解码包含在UTF-8编码中的Windows-1252编码内容,这已经解决了最常见的一类问题.

在创建 BeautifulSoup 或 UnicodeDammit 对象前一定要先对文档调用 UnicodeDammit.detwingle() 确保文档的编码方式正确。如果尝试去解析一段包含Windows-1252编码的UTF-8文档,就会得到一堆乱码,比如: ☃☃☃“I like snowmen!”。

>>> help(UnicodeDammit) Help on class UnicodeDammit in module bs4.dammit: class UnicodeDammit(builtins.object) | A class for detecting the encoding of a *ML document and | converting it to a Unicode string. If the source encoding is | windows-1252, can replace MS smart quotes with their HTML or XML | equivalents. | | Methods defined here: | | __init__(self, markup, override_encodings=[], smart_quotes_to=None, is_html=False, exclude_encodings=[]) | Initialize self. See help(type(self)) for accurate signature. | | find_codec(self, charset) | | ---------------------------------------------------------------------- | Class methods defined here: | | detwingle(in_bytes, main_encoding='utf8', embedded_encoding='windows-1252') from builtins.type | Fix characters from one encoding embedded in some other encoding. | | Currently the only situation supported is Windows-1252 (or its | subset ISO-8859-1), embedded in UTF-8. | | The input must be a bytestring. If you've already converted | the document to Unicode, you're too late. | | The output is a bytestring in which `embedded_encoding` | characters have been converted to their `main_encoding` | equivalents. ......

最后,Python中安装了 chardet 或 cchardet 那么编码检测功能的准确率将大大提高。

3 diagnose()

使用diagnose()方法,BeautifulSoup会输出一份报告,说明不同的解析器会怎样处理这段文档,并标出当前的解析过程会使用哪种解析器:

In [19]: test_doc = "\xc3\xc3\xd7\xd3\xcd\xbc - \xc7\xe5\xb4\xbf\xc3\xc0\xc5\xae,\xbf\xc9\xb0\xae\xc3\xc0\xc5\xae,\xc3\xc0\xc5\xae\xcd\xbc\xc6\xac" ...: In [20]: from bs4.diagnose import diagnose In [21]: diagnose(test_doc) Diagnostic running on Beautiful Soup 4.6.0 Python version 3.6.1 |Anaconda 4.4.0 (32-bit)| (default, May 11 2017, 14:16:49) [MSC v.1900 32 bit (Intel)] Found lxml version 3.7.3.0 Found html5lib version 0.999 Trying to parse your markup with html.parser Here's what html.parser did with the markup: ÃÃ×Óͼ - Çå´¿ÃÀÅ®,¿É°®ÃÀÅ®,ÃÀŮͼƬ -------------------------------------------------------------------------------- Trying to parse your markup with html5lib Here's what html5lib did with the markup: <html> <head> </head> <body> ÃÃ×Óͼ - Çå´¿ÃÀÅ®,¿É°®ÃÀÅ®,ÃÀŮͼƬ </body> </html> -------------------------------------------------------------------------------- Trying to parse your markup with lxml Here's what lxml did with the markup: <html> <body> <p> ÃÃ×Óͼ - Çå´¿ÃÀÅ®,¿É°®ÃÀÅ®,ÃÀŮͼƬ </p> </body> </html> -------------------------------------------------------------------------------- Trying to parse your markup with ['lxml', 'xml'] Here's what ['lxml', 'xml'] did with the markup: <?xml version="1.0" encoding="utf-8"?> -------------------------------------------------------------------------------- C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available XML parser for this system ("lxml-xml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 231 of the file C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP}) to this: BeautifulSoup(YOUR_MARKUP, "lxml-xml") markup_type=markup_type))
转载请注明原文地址: https://www.6miu.com/read-200152.html

最新回复(0)