What I really needed was a list, or table, of all their teas and their prices. I figured, "this would make a fun little Python scripting project. Why the heck not?"
So I did a quick Google search and found a Python HTML library called Beautiful Soup. It offers a simple and elegant way to parse XML/HTML and do various operations on it. Here is a quick and dirty example on using Beautiful Soup to parse webpages:
Download Beautiful Soup
The latest version can be found here. You can also visit the website link above to see which version best fits your needs. Save it to the directory of your script.
Let's say we have a website with the URL http://www.random.com we are trying to pull that looks like this:
A Random Site A Random Website!
Python code for fetching and parsing the page
from BeautifulSoup import BeautifulSoup import urllib pageFile = urllib.urlopen("http://www.random.com") pageHtml = pageFile.read() pageFile.close() soup = BeautifulSoup("".join(pageHtml))All this does is fetch the webpage HTML by using Python's built-in urllib. Once we have the actual HTML for the page, we create a new BeautifulSoup class to take advantage of its simple API.
Useful BeautifulSoup API calls for reading HTML elements
Let's say I want to get a list of all the menu items for the page, I could do the following:soup.findAll("li", {"class":"menuItem"}) #
You can also call the same function as a method of another element. For example,
menu1 = soup.find("ul", {"id":"menu1"}) # menu1 =Here we use the find() method to get a single element ul with id = menu1. menu1 is an element we can treat as its own BeautifulSoup!
menu1.findAll("li") # [
Now, suppose we want the content that lies in between the opening and closing tags of an element. We can do this by doing,
items = menu1.findAll("li") # items = [
Note: Beautiful Soup always gives you Unicode strings.
There you have it. A very simple quick start to fetching a webpage and parsing it using Beautiful Soup!
This comment has been removed by the author.
ReplyDeletelove it
ReplyDelete