Using Beautiful Soup for Screen Scraping

posted on November 12th, 2008 by Greg in Personal Projects

I’ve been curious to learn more about screen scraping for some time. And then I heard about a python script that is great for parsing html. Since I’ve also been learning python, I thought now was the perfect time to explore some scraping.

In the past I had some trouble with using php to parse the magic the gathering official site for new card info when working on my mtg card database. I didn’t spend much time trying to figure that out, but using python I didn’t have a problem.

After copying Beautiful Soup to my python path I started typing in some python at the command line.

from BeautifulSoup import BeautifulSoup as BSoup
import urllib
url  = 'http://ww2.wizards.com/gatherer/Index.aspx?setfilter=Shards%20of%20Alara&output=Spoiler'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for tr in soup.fetch('tr'):
    if tr.td:
        print tr.td.string

This would output all of the magic card names on the page (and some other stuff). Here is another example: getting image urls when knowing the value of the id attribute on the img tags.

url  = 'http://ww2.wizards.com/gatherer/CardDetails.aspx?&id=175000'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for img in soup.findAll(id='_imgCardImage'):
    print img['src']

With a little more time cooking the soups I could get all the cards and their images and fill up my database. I just have to find the time now.

Going with Slicehost Instead of AWS EC2

posted on October 14th, 2008 by Greg in Personal Projects

I ran into some trouble with python2.4 and the django code I was using. The previous server had 2.5 and I didn’t notice any problems, so I tried upgrading to 2.5 and changing which version of python Debian uses as default (this was on Debian Etch). I was having some difficulty getting a few of the site-packages to work with 2.5 by default (like mod_python), so I decided to move to Debian Lenny even though it isn’t as supported. While doing that I ran into a problem where it doesn’t work well with xfs and Amazon’s Elastic Block Store. They are looking into the matter, but while trying to figure that out, I realized that AWS doesn’t come with support. There is an extra package you have to purchase which starts at $100 a month.

That made Amazon look less awesome since I know I am going to need some support at some point. I decided to compare prices and features around again. I ended up revisiting Slicehost since I knew a lot more about setting up a server than I did before.

I posted the steps that I took to set up apache, mysql, django, and a few other things on a clean ubuntu machine on Code Spatter.

Now I have a WebFaction account for testing and subversion hosting and I’m using the Slicehost account for the live version of the site.

Subversion makes it easy to commit on one server and update on the other once it is stable. I should explore a distributed version control system like git since it might help out with this in the future.

Update October 21, 2008

The AWS developer community seems to be a good alternative to having direct support from amazon. The people there are knowledgeable and amazon reps post frequently. Here is a quote from someone at amazon about the issue I was having

We are still investigating the issue and will post an analysis a little later and a workaround.  Basically the problem revolves around the interaction between very specific kernel versions, XFS and our version of Xen.

Even though my slice is running fine, I will still be keeping AWS in mind.

Main Page Updater for Emergencies

posted on October 3rd, 2008 by Greg in CDWS Projects

At a large institution like UCF, it is good to have a plan for emergencies. I set up a simple form that will update the main page at http://ucf.edu in an emergency so that important information can be realeased as fast as possible.

The main page is an html file that is copied every few minutes from our database driven application. This speeds up the website and cuts down on processor utilization considerably. A simple update to our cron job was added that checks if the site is in emergency mode and pulls from our other emergency page. This emergency page is created with a simple form and simple template file.

I created this page updater to be reliable and simple so that there is little turn arround time from emergency situation to information available. The form edits files in the filesystem instead of using a database that would require more complexity. There is a place to update the important information. That info is then put into the pre-built template when the user hits preview. Once the user is satisfied with the way it looks, there is a button to enable/disable the page. It updates the status that the cron job looks for and the main page will change in under a minute.

AWS, EBS, S3, EC2, Debian, Django, Apache, and mod_python

posted on September 23rd, 2008 by Greg in Personal Projects

Yesterday I dove into amazon’s web services to check it out as a solution for a project I’m working on. I followed a guide to setup django development server on a default amazon machine image to start off. Then I decided to go with a debian AMI and do a full production server. I used apt-get to install the newest versions of apache, python, mysql, mod_python, svn, and some others. Debian turned out to be a lot easier than some other flavors of linux I have used.

After getting the instance configured the way I wanted it, I saved an image of it to my storage bucket so I could bring it up at any time instead of paying ten cents an hour until I need it.

A recent post updates the Amazon Adventure.

Social Network Built with Django

posted on August 27th, 2008 by Greg in Personal Projects

I was learning python and django earlier to build a social network. So far, I have created the ability for users to

  • create an account with e-mail activation
  • login/out
  • add other users as friends and confirm friendship that other users requested
  • send/reply/forward messages

This was the base for a niche social network to be built upon.

Soon after completing those features, I discovered elgg. It’s an open source social network written in php. It can do all of those features and more. I am now looking into using that and modifying it for the original goal.

We’ve gone back to django since elgg wasn’t the easiest thing to modify. I was hoping they might have used a common php framework like cake or code igniter. More on the django developments in another post soon. On CodeSpatter I have posted about what I learned about Python, PIL, and Django working together.

Update November 12, 2008

If you are looking for an Open Source Social Network written in Django, Pinax is looking really good right now. They have combined many reusable django apps into one slick project. Cloud27 is set up as an example of all the features included in Pinax. The contact importing feature is one that I will be adding to my social app that I built before having knowledge of Pinax.

UCF.edu v4 Middle End

posted on June 23rd, 2008 by Greg in CDWS Projects

I say middle end (even though it’s not an end) since I didn’t work on the front-end skin or on the content-management back end. The new site was launched a few days ago (6/21/8) and uses an installment of InQuira Information Manager as the content management system. With the CMS comes a JSP tag library. I used the tag library to extract data from the CMS and format it in the front-end layout. I was also responsible for designing the structure of the channels, categories, and schema in InfoManager. Read the rest of this entry »

Information and Knowledge Engineering 2008

posted on April 16th, 2008 by Greg in Class Projects

With the help of Dr. Orooji, I have written a paper about the programming team website I created. It has been accepted to the Information and Knowledge Engineering Conference which is a part of the larger World Congress in Computer Science, Computer Engineering, and Applied Computing Conference (WorldComp). This year’s event (2008) will be held in Las Vegas from July 14 to July 17.

Along with getting the paper published, I will have a 20 minute slot for a presentation.

Links and more information will be posted here as it develops.

Updates

The times for the presentations have been set.  There is a pdf file containing all of the different conferences and their presentations. I am on page 132 which shows me presenting on the first day at 4pm.

Code Spatter

posted on April 1st, 2008 by Greg in Personal Projects

Code Spatter is a personal project that I started when I thought it would be useful to have a Weblog about projects and other things involving web development to be used by myself and other co-workers. It was also a chance to use CyTE for a practical application and start development on MorfU. Both are open source projects that I develop for.

Read the rest of this entry »

Tragedy Guild Website

posted on April 1st, 2008 by Greg in Personal Projects

The Guild

Tragedy was a guild in World of Warcraft that had up to 40 members in a single raid event as often as 4-5 nights a week. There was a lot of information that needed to be saved from the raids. It was important to know which members attended them and which monsters were defeated that evening. The monsters would drop loot and it was necessary to know who received the loot. There was a game modification that would store all of this data, but there wasn’t an easy way to get this information onto the website.

Read the rest of this entry »

TechStream (aka ToBeDone 2.0)

posted on April 1st, 2008 by Greg in CDWS Projects

Workflow Management

To Be Done is a Web-based workflow tool that manages the collection, tracking, and processing of work requests. It is written in PHP and uses a MySQL database. It facilitates the collaboration between teams by enabling team members to create requests for other teams’ members to complete. Time-to-completion data is stored when a user completes a request and can be used to display totals, percentages, and averages of requests and hours in a report that can be generated automatically. The report that is generated can also display specific information per user and per course.

Read the rest of this entry »