November 5, 2008

Screen Scraping with PHP and cURL

A couple of weeks ago I happened to meet two local political activists who are trying to get the parties that sit in opposition to the recently reelected Conservatives to form a coalition government. They have started a movement called ‘Canadians for a Progressive Coalition‘ and have a facebook group by the same name. Their aim is to have constituents lobby their respective Member of Parliament or the presidents of their local riding associations – asking them for their support and that they lobby their respective parties in turn.

Upon mentioning that they were in the midst of creating a website for their group, I let them know that I was into web development. They asked if I might be interested in helping with some of the functionality of their website. Basically, they were looking for a way in which the user could easily identify who their Member of Parliament was based on their postal code without having to redirect the user to another website. Thinking this to be a cool project (I lean a bit to the left) and kind of fun, I said I would look into it and we exchanged contact information.

I assumed that Statistics Canada would provide a simple API in order that this sort of information be made easily accessible to the public – after all, it is the publics tax dollars that go into the creation and maintenance of this data set.

Not so.

The Canadian government charges an exorbitant and prohibitive fee of $2,500 for the first year and $500 for each subsequent year for a licensed copy of the Postal Codes by Federal Ridings File (PCFRF). Interestingly enough, political parties do not pay for this database while advocacy groups and other non-profits do. Personally, I find this to be unacceptable in a participatory democracy such as Canada’s and would encourage anyone who feels the same way to visit Digital Copyright Canada‘s website and join in the petitioning of the government to have this information be made freely accessible to all.

Anyway, I was still trying to find a way to achieve my ends without violating copyright law. The solution that I came up with was to employ a method known as ‘screen scraping‘. Using this technique, the user submits a request for the wanted information as they would normally. The difference being that, instead of querying a database directly, the request is passed to a another website (in this case, the Parliament of Canada website) that serves the requested information back as HTML. It is at this point that we take the HTML source code, dump it to a text file and parse the file to get the information that we are after. Once this has been accomplished, we then ‘re-display’ the parsed information back to the user.

scraper

And the class:

This is done using PHP5 and cURL (a wrapper for libCurl library). I have included the source code which can be found here. I know the OOP is a bit of overkill for such a simple routine but it’s the style I like working in.

I should mention that this is an extremely unreliable method of gathering information as any changes that are made to the HTML of the ‘scraped’ site could potentially render the script useless.

I don’t know that ‘Canadians for a Progressive Coalition’ will ever use this – there possibly being some concern that this is indeed a violation of copyright law – but it was a fun little exercise nonetheless.

One thought on “Screen Scraping with PHP and cURL

Leave a Reply

Your email address will not be published. Required fields are marked *