Tuesday, August 16, 2011

Quick Tutorial: Parsing a website in Python with BeautifulSoup

I was recently doing some online shopping at one of my favorite tea retailers, Remedy Teas. Although I'm very much in love with the cafe and the teas they provide, I wasn't pleased to find the online store lacking a full list of their available teas. Instead, each category of tea has its own webpage. Subsequently, each tea has its own webpage with relevant information like steep time, water temperature, tags, etc.

What I really needed was a list, or table, of all their teas and their prices. I figured, "this would make a fun little Python scripting project. Why the heck not?"

So I did a quick Google search and found a Python HTML library called Beautiful Soup. It offers a simple and elegant way to parse XML/HTML and do various operations on it. Here is a quick and dirty example on using Beautiful Soup to parse webpages:

Download Beautiful Soup


The latest version can be found here. You can also visit the website link above to see which version best fits your needs. Save it to the directory of your script.

Let's say we have a website with the URL http://www.random.com we are trying to pull that looks like this:

    
        A Random Site
    
    
        

A Random Website!

Python code for fetching and parsing the page

from BeautifulSoup import BeautifulSoup
import urllib

pageFile = urllib.urlopen("http://www.random.com")
pageHtml = pageFile.read()
pageFile.close()

soup = BeautifulSoup("".join(pageHtml))
All this does is fetch the webpage HTML by using Python's built-in urllib. Once we have the actual HTML for the page, we create a new BeautifulSoup class to take advantage of its simple API.

Useful BeautifulSoup API calls for reading HTML elements

Let's say I want to get a list of all the menu items for the page, I could do the following:
soup.findAll("li", {"class":"menuItem"})

# , , , , , 
Using the findAll() method, we get a whole list of elements with the tag li and the attribute class = menuItem. Nice and easy!

You can also call the same function as a method of another element. For example,
menu1 = soup.find("ul", {"id":"menu1"})

# menu1 = 
Here we use the find() method to get a single element ul with id = menu1. menu1 is an element we can treat as its own BeautifulSoup!
menu1.findAll("li")

# [, , ]
Using the findAll() method, we can find all the children of menu1 that are an li element.

Now, suppose we want the content that lies in between the opening and closing tags of an element. We can do this by doing,
items = menu1.findAll("li")

# items = [, , ]

items[0].contents[0].strip()
This will get the first li item in menu1 and grab anything in between its opening and closing tags. If there happens to be even more elements, it will grab those as elements as well. That's why contents is a list.
Note: Beautiful Soup always gives you Unicode strings.

There you have it. A very simple quick start to fetching a webpage and parsing it using Beautiful Soup!

2 comments: