Scraping the Web with Ruby
by Nick Gauthier on
When you run into a site that doesn’t have an API, but you’d like to use the site’s data, sometimes all you can do is scrape it! In this article, we’ll cover using Capybara and PhantomJS along with some standard libraries like CSV, GDBM, and OpenStruct to turn a website’s content into CSV data.
As an example, I’ll be scraping my own site. Keep in mind if you are planning to scrape a site, be aware of their Terms of Service. Many sites don’t allow scraping, so be a good web citizen and respect the wishes of the site’s owners.
Our goal here is to dump out a CSV file of the blog articles here on ngauthier.com that contains the titles, dates, urls, summary, and full body text. Let’s get started!
1. Scraping with Capybara
The first thing we’re going to do is get Capybara running and just dump the fields from the front page out to the screen to make sure everything’s working.
First, we’re going to load up capybara and poltergeist. Poltergeist is a ruby gem that wraps phantomjs so that we can run a real browser on the external pages.
#!/usr/bin/env ruby require 'capybara' require 'capybara/poltergeist' include Capybara::DSL Capybara.default_driver = :poltergeist
Next up, we’re going to visit my site. Then we’ll iterate over every post and pull out the fields we want using css selectors and dump them to the screen.
visit "http://ngauthier.com/" all(".posts .post").each do |post| title = post.find("h3 a").text url = post.find("h3 a")["href"] date = post.find("h3 small").text summary = post.find("p.preview").text puts title puts url puts date puts summary puts "" end
When we run it, we can see that our scraper is loading up the homepage and finding the info:
Using Docker to Parallelize Rails Tests /2013/10/using-docker-to-parallelize-rails-tests.html 2013-10-13 Docker is a new way to containerize services. The primary use so far has been for deploying services in a very thin container. I experimented with using it for Rails Continuous Integration so that I could run tests within a consistent environment, and then I realized that the containers provide excellent encapsulation to allow for parallelization of test suites. There are three things we'll have to do to run a Rails test suite in parallel in docker: Create a Dockerfile for a system that ca... PostGIS and Rails: A Simple Approach /2013/08/postgis-and-rails-a-simple-approach.html 2013-08-18 PostGIS is a geospatial extension library for PostgreSQL that allows you to perform a ton of geometric and geographic operations on your data at high speeds. For example: Compute the distance between two points Find all the points within X meters of point P Determine which points are enclosed in polygon P A million other things In Ruby land, there is a gem called RGeo that provides a ton of objects and methods for handling Geospatial objects. In Rails, there are a number of ActiveRecord ... ...
Exporting CSV
Our next goal is to export some CSV. All we’ll do is load up the csv
standard library and write csv to standard out.
#!/usr/bin/env ruby require 'capybara' require 'capybara/poltergeist' require 'csv' include Capybara::DSL Capybara.default_driver = :poltergeist visit "http://ngauthier.com/" CSV do |csv| csv << ["Title", "URL", "Date", "Summary"] all(".posts .post").each do |post| title = post.find("h3 a").text url = post.find("h3 a")["href"] date = post.find("h3 small").text summary = post.find("p.preview").text csv << [title, url, date, summary] end end
Now, our program is unixy, so we would run it and redirect its output to a csv file of our choosing. Nice!
Full Articles via Multi-Pass
So far, we’ve only pulled the summaries of the posts, but we want the whole content. For this, we’re going to need to do a two-pass scrape. First, we’ll scrape the summaries like we already are doing. Second, we’ll visit each post’s url and grab the post’s body.
To do this, we’re going to keep track of an articles
array, and store the articles temporarily as hashes.
#!/usr/bin/env ruby require 'capybara' require 'capybara/poltergeist' require 'csv' include Capybara::DSL Capybara.default_driver = :poltergeist visit "http://ngauthier.com/" articles = [] # Pass 1: summaries and info all(".posts .post").each do |post| title = post.find("h3 a").text url = post.find("h3 a")["href"] date = post.find("h3 small").text summary = post.find("p.preview").text articles << { title: title, url: url, date: date, summary: summary } end # Pass 2: full body of article articles.each do |article| visit "http://ngauthier.com#{article[:url]}" article[:body] = find("article").text end # Output CSV CSV do |csv| csv << ["Title", "URL", "Date", "Summary", "Body"] articles.each do |article| csv << [ article[:title], article[:url], article[:date], article[:summary], article[:body] ] end end
That wasn’t so bad! The main issue now is robustness. My blog is fast (static sites with jekyll woo!) and I only have a couple posts. But imagine the scrape took an hour. If anything crashed, took too long, or the site went down, we’d waste a ton of time.
Handling Interruptions with GDBM
Enter GDBM. GDBM is a standard unix library that is a simple file-based key-value store. It’s like a hash, but you can only use strings as the values. What we’re going to do is replace our article array with a GDBM store. For the key we’ll use the url, and for the value we’ll dump the article to JSON.
Now, if the program crashes, when we resume we want to skip over any articles we’ve already processed. Let’s take a look:
#!/usr/bin/env ruby require 'capybara' require 'capybara/poltergeist' require 'csv' require 'gdbm' include Capybara::DSL Capybara.default_driver = :poltergeist visit "http://ngauthier.com/" articles = GDBM.new("articles.db")
We’ve required gdbm and now articles
is a GDBM store in a file in the current directory called articles.db
. We can now use articles
like a hash, and GDBM will sync it to the file system whenever we write to a key. Neat!
# Pass 1: summaries and info all(".posts .post").each do |post| title = post.find("h3 a").text url = post.find("h3 a")["href"] date = post.find("h3 small").text summary = post.find("p.preview").text next if articles[url] articles[url] = JSON.dump( title: title, url: url, date: date, summary: summary ) end
Now, we have a next if
that checks to see if we already have the article. We don’t want to store it if we already have it, because that would overwrite the article (and we may have fetched the body).
When we store it, we JSON.dump
our hash so that it’s a string.
# Pass 2: full body of article articles.each do |url, json| article = JSON.load(json) next if article["body"] visit "http://ngauthier.com#{url}" has_content?(article["title"]) or raise "couldn't load #{url}" article["body"] = find("article").text articles[url] = JSON.dump(article) end
When we iterate a GDBM store it gives us a key-value pair, like a hash. So, we have to load up the article from JSON before we can work with it. Our keys also become strings instead of symbols, because it was loaded from JSON.
We have another next if
that skips visiting the page if we have the body already. Additionally, we check the page for the title. This gives the page time to load, as opposed to accidentally scraping the last page’s content.
Finally, we have to dump the article to JSON to store it again.
# Output CSV CSV do |csv| csv << ["Title", "URL", "Date", "Summary", "Body"] articles.each do |url, json| article = JSON.load(json) csv << [ article["title"], article["url"], article["date"], article["summary"], article["body"] ] end end
When we output to CSV, we have to load up the JSON to construct our CSV output.
Now that GDBM is running, the first time I ran the scrape it took 7.3 seconds, and the second time, it took 1.7 seconds because it only hit the index and skipped scraping. I can also ctrl-c the program while it’s working, and re-run it and it will resume. Nice!
Object Oriented
OK at this point the code is pretty ugly and scripty, and some things are getting annoying. Let’s clean up this code.
First Pass: Base Class, Instance Variables and Method Splitting
First off, we’re doing everything in the global scope, and Capybara complains every time we include Capybara::DSL
in the global scope because it extends it with methods like find
and visit
. Convenient, but messy. Let’s create a base class called NickBot
and we’ll split up each phase into a method for readability.
#!/usr/bin/env ruby require 'capybara' require 'capybara/poltergeist' require 'csv' require 'gdbm' class NickBot include Capybara::DSL def initialize(io = STDOUT) Capybara.default_driver = :poltergeist @articles = GDBM.new("ngauthier.db") @io = io end def scrape get_summaries get_bodies output_csv end def get_summaries visit "http://ngauthier.com/" all(".posts .post").each do |post| title = post.find("h3 a").text url = post.find("h3 a")["href"] date = post.find("h3 small").text summary = post.find("p.preview").text next if @articles[url] @articles[url] = JSON.dump( title: title, url: url, date: date, summary: summary ) end end def get_bodies @articles.each do |url, json| article = JSON.load(json) next if article["body"] visit "http://ngauthier.com#{url}" has_content?(article["title"]) or raise "couldn't load #{url}" article["body"] = find("article").text @articles[url] = JSON.dump(article) end end def output_csv CSV(@io) do |csv| csv << ["Title", "URL", "Date", "Summary", "Body"] @articles.each do |url, json| article = JSON.load(json) csv << [ article["title"], article["url"], article["date"], article["summary"], article["body"] ] end end end end NickBot.new(STDOUT).scrape
OK, that’s better, but really all we did was push our veggies around our plate. We didn’t eat any of them.
Second Pass: Article Class
Let’s start with an Article
that will wrap up parsing a Capybara node and dumping itself automatically. GDBM calls an object’s to_str
to dump it, so we can hook in there to dump ourselves to JSON:
#!/usr/bin/env ruby require 'capybara' require 'capybara/poltergeist' require 'csv' require 'gdbm' class NickBot include Capybara::DSL def initialize(io = STDOUT) Capybara.default_driver = :poltergeist @articles = GDBM.new("ngauthier.db") @io = io end def scrape get_summaries get_bodies output_csv end def get_summaries visit "http://ngauthier.com/" all(".posts .post").each do |post| article = Article.from_summary(post) next if @articles[article.url] @articles[article.url] = article end end def get_bodies @articles.each do |url, json| article = Article.new(JSON.load(json)) next if article.body visit "http://ngauthier.com#{url}" has_content?(article.title) or raise "couldn't load #{url}" article.body = find("article").text @articles[url] = article end end def output_csv CSV(@io) do |csv| csv << ["Title", "URL", "Date", "Summary", "Body"] @articles.each do |url, json| article = Article.new(JSON.load(json)) csv << [ article.title, article.url, article.date, article.summary, article.body ] end end end class Article < OpenStruct def self.from_summary(node) new( title: node.find("h3 a").text, url: node.find("h3 a")["href"], date: node.find("h3 small").text, summary: node.find("p.preview").text, ) end def to_str to_h.to_json end end end NickBot.new(STDOUT).scrape
We’re using OpenStruct here to cheaply get a hash-to-object conversion so we can use dot notation to access fields. This makes it feel way more like an object. It also means we could replace the accessors later transparently.
This is better, but we still have a few issues with entanglement between NickBot and Article:
- NickBot has to remember to load an Article from JSON
- NickBot holds the database information where Articles are stored
- NickBot has to know how GDBM works in order to iterate
Let’s clean this up next.
Third Pass: Active Record Pattern
We’re going to implement the Active Record Pattern. No, I’m not going to require 'activerecord'
, I’m talking about the classic Active Record Pattern from Patterns of Enterprise Application Architecture. The key here is to mix in the database behavior with the Article class so it can handle iteration, storage, and retrieval without NickBot having to understand how it works.
#!/usr/bin/env ruby require 'capybara' require 'capybara/poltergeist' require 'csv' require 'gdbm' class NickBot include Capybara::DSL def initialize(io = STDOUT) Capybara.default_driver = :poltergeist @io = io end def scrape visit "http://ngauthier.com/" all(".posts .post").each do |post| article = Article.from_summary(post) next unless article.new_record? article.save end Article.each do |article| next if article.body visit "http://ngauthier.com#{article.url}" has_content?(article.title) or raise "couldn't load #{url}" article.body = find("article").text article.save end CSV(@io) do |csv| csv << ["Title", "URL", "Date", "Summary", "Body"] Article.each do |article| csv << [ article.title, article.url, article.date, article.summary, article.body ] end end end
Let’s start looking at the usage of Article above. We’ve added Article#new_record?
, Article#save
, and Article.each
. But also notice that NickBot no longer has an @articles
instance variable. Cool! Now NickBot doesn’t care at all how Articles are persisted.
Let’s take a look at our new Article class.
class Article < OpenStruct DB = GDBM.new("articles.db") def self.from_summary(node) new( title: node.find("h3 a").text, url: node.find("h3 a")["href"], date: node.find("h3 small").text, summary: node.find("p.preview").text, ) end def self.each DB.each do |url, json| yield Article.new(JSON.load(json)) end end def save DB[url] = to_h.to_json end def new_record? DB[url].nil? end end end
It still inherits from OpenStruct, because that gives us our initialize(attributes)
and to_h
for cheap. But now we have a DB
constant that is set when the code loads up and connects to our GDBM file. This simple constant allows the Article class and also Article instances to access the GDBM database file.
We can now write an easy each
that iterates the database and does the JSON load. We can also write our save
that dumps using JSON. The new_record?
is simply a check for an existing key.
The cool thing is, we could switch our GDBM implementation our for any other type of persistence. We could be using PostgreSQL if we needed more structure, or maybe Redis if we wanted to run it on Heroku with minimal changes.
Wrap-up
I’ve found GDBM to be a really useful little library, because when I am writing utility scripts I don’t want to maintain a PostgreSQL database, but I also need some persistence and reliability. It’s a great step up from writing to csv repeatedly (which is pretty slow and also error prone when the program crashes during a write!).
I continue to find wonderful nuggets of awesome in Ruby’s standard library all the time!
Final source code available as a gist.
blog comments powered by Disqus