An Introduction To Web Scraping

Why

Having been working on Notice for a few months now, we were begining to look at our plans for the coming Winter term. Wanting to maximize our efforts by targetting only professors that had large classes (and numbers of classes), we had a desire to sort and filter classes at Oregon State University by number of students enrolled.

While the university provides this information on their Course Catalog site, there is limited sorting functionality and most data requires you to descend numerous subpages to find only one class at a time.

We decided that the desire and need for creating a web scraper was present and that the effort was worth it.

How

Having been working primarily in Ruby at the time, it was what I decided to write it in and after doing some quit background searching, I found Nokogiri for XML parsing and open-uri for easy HTTP interaction.

I set off with parsing the department pages before descending into class pages and then gathering actual section data.

Having completed a rough prototype, I tried it out only to discover that based on the current speed per class, it would take somewhere in the range of a few hours to run… something that wasn’t acceptable for this use case.

Having remembered reading articles about Ruby’s GIL, I knew that using threads with the MRI wasn’t going to make much of a speed difference so I looked at both Rubinius and jRuby as alternative runtimes. I determined that to make effective use of threads in the JVM, I wouldn’t be able to write idiomatic Ruby code, so I settled on using Rubinius and began working on making the primary parts of the script concurrent.

To top it all off, I found a progress bar library to show some rough estimations of how far along we were.

Results

It took a couple hours to get it working correctly (spread over a few days), but the end result ran in about 7 minutes on a DigitalOcean (512mb) droplet, which was good enough for me.

You can find the code on github at jonahgeorge/osu-course-scraper with the results of the operation in the /data folder as timestamped csv files based on when it was run.