Archive for November, 2008

Using Beautiful Soup for Screen Scraping

posted on November 12th, 2008 by Greg in Personal Projects

I’ve been curious to learn more about screen scraping for some time. And then I heard about a python script that is great for parsing html. Since I’ve also been learning python, I thought now was the perfect time to explore some scraping.

In the past I had some trouble with using php to parse the magic the gathering official site for new card info when working on my mtg card database. I didn’t spend much time trying to figure that out, but using python I didn’t have a problem.

After copying Beautiful Soup to my python path I started typing in some python at the command line.

from BeautifulSoup import BeautifulSoup as BSoup
import urllib
url  = 'http://ww2.wizards.com/gatherer/Index.aspx?setfilter=Shards%20of%20Alara&output=Spoiler'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for tr in soup.fetch('tr'):
    if tr.td:
        print tr.td.string

This would output all of the magic card names on the page (and some other stuff). Here is another example: getting image urls when knowing the value of the id attribute on the img tags.

url  = 'http://ww2.wizards.com/gatherer/CardDetails.aspx?&id=175000'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for img in soup.findAll(id='_imgCardImage'):
    print img['src']

With a little more time I could get all the cards and their images and fill up my database. I just have to find the time now.

Update 12/28/8

I just heard about Scrapy. Now I need to try it out with a project.