以网站http://www.pythonscraping.com/pages/page3.html为例。
<html> <head> <style> img{ width:75px; } table{ width:50%; } td{ margin:10px; padding:10px; } .wrapper{ width:800px; } .excitingNote{ font-style:italic; font-weight:bold; } </style> </head> <body> <div id="wrapper"> <img src="../img/gifts/logo.jpg" style="float:left;"> <h1>Totally Normal Gifts</h1> <div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is hand-curated by well-paid, free-range Tibetan monks.<p> We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br> 123 Main St.<br> Abuja, Nigeria </br>We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</div> <table id="giftList"> <tr><th> Item Title </th><th> Description </th><th> Cost </th><th> Image </th></tr> <tr id="gift1" class="gift"><td> Vegetable Basket </td><td> This vegetable basket is the perfect gift for your health conscious (or overweight) friends! <span class="excitingNote">Now with super-colorful bell peppers!</span> </td><td> $15.00 </td><td> <img src="../img/gifts/img1.jpg"> </td></tr> <tr id="gift2" class="gift"><td> Russian Nesting Dolls </td><td> Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span> </td><td> $10,000.52 </td><td> <img src="../img/gifts/img2.jpg"> </td></tr> <tr id="gift3" class="gift"><td> Fish Painting </td><td> If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span> </td><td> $10,005.00 </td><td> <img src="../img/gifts/img3.jpg"> </td></tr> <tr id="gift4" class="gift"><td> Dead Parrot </td><td> This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span> </td><td> $0.50 </td><td> <img src="../img/gifts/img4.jpg"> </td></tr> <tr id="gift5" class="gift"><td> Mystery Box </td><td> If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span> </td><td> $1.50 </td><td> <img src="../img/gifts/img6.jpg"> </td></tr> </table> </p> <div id="footer"> © Totally Normal Gifts, Inc. <br> +234 (617) 863-0736 </div> </div> </body> </html>它将打印giftList表格中所有产品的数据行。
注意区分子标签和后代标签。所有的子标签都是后代标签,但不是所有的后代标签都是子标签。
next_slibings()函数使得收集表格数据更简单。
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html) for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings: print(sibling)打印除标题行以外的标签。因为自己不能是自己的兄弟标签。
parent和parents,虽然很少用,但有时候也会用到。
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html) print(bsObj.find("img",{"src":"../img/gifts/img1.jpg" }).parent.previous_sibling.get_text())这段代码会打印¥15.00.
过程如下:
