python爬虫学习第十二天

xiaoxiao2021-02-28  124

今天学习了用Beautifulsoup函数来获取指定的节点,以及用当前结点顺藤摸瓜找到其子节点,后代节点,兄弟节点,父节点。

练习1 findAll 函数抽取只包含在 标签里的文字 还顺便把class=’red’标签里的内容也提取了

# from urllib.request import urlopen # from bs4 import BeautifulSoup # r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html') # bsObj = BeautifulSoup(r) # persons = bsObj.findAll('span',{'class':'green'}) # conversasions = bsObj.findAll('span',{'class':'red'}) # for name in persons: # print(name.get_text()) # print('\n') # for talks in conversasions: # print(talks.get_text())

练习2 查找内容匹配的html元素 查找html元素在昨天已经练习过了就是find/findall函数。 利用这两个函数的tag参数与tagAtrribute参数可以让我们检索大多数标签,此外我们还可以通过text参数(下面的例子正是如此)匹配内容包含制定字符串的标签

# from urllib.request import urlopen # from bs4 import BeautifulSoup # r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html') # bsObj = BeautifulSoup(r) # test = bsObj.findAll(text = 'the prince') # print(len(test))

练习3 子标签和后代标签 注意他们的区别

子标签就是一个父标签的下一级,而后代标签是指一个父标签 下面所有级别的标签。所有的子标签都是后代标 签,但不是所有的后代标签都是子标签。

# from urllib.request import urlopen # from bs4 import BeautifulSoup # r = urlopen('http://www.pythonscraping.com/pages/page3.html') # bsObj = BeautifulSoup(r) # for child in bsObj.find('table',{'id':'giftList'}).children: # print(child) # print('\n') # for descendant in bsObj.find('table',{'id':'giftList'}).descendants: # print(descendant)

练习4 用next_siblings获取兄弟节点

# from urllib.request import urlopen # from bs4 import BeautifulSoup # r = urlopen('http://www.pythonscraping.com/pages/page3.html') # bsObj = BeautifulSoup(r) # for sibling in bsObj.find('table',{'id':'giftList'}).tr.next_siblings: # print(sibling)

练习5 用parent/parents操作父节点

# from urllib.request import urlopen # from bs4 import BeautifulSoup # r = urlopen('http://www.pythonscraping.com/pages/page3.html') # bsObj = BeautifulSoup(r) # money = bsObj.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling # print(money.get_text())
转载请注明原文地址: https://www.6miu.com/read-37264.html

最新回复(0)