Importing data quickly in Ruby on Rails applications

By Zach Dennis on 27 06 2016

Application frameworks often provide their own out of the box services for interacting with databases. This works great when processing data in a onesie-twosie fashion, but what about large batches of data?

I ran into this problem over a decade ago with the not yet 1.0 Ruby on Rails framework. I had a lot of data that I needed to routinely integrate into a local application, but ActiveRecord's mechanism for creating records was too slow. It was taking several hours. At that time I wrote a library for importing large batches of data with ActiveRecord in an efficient manner. The job that took several hours now took under two minutes. That library today is known as the activerecord-import and its available as a rubygem.

Over the years activerecord-import has become the defacto standard for efficiently importing large batches of data with ActiveRecord. It has been maintained by an awesome community of contributors. Not only has it kept up with Rails development, but it works with several databases. The big three being: MySQL, SQLite3, PostgreSQL. Oh, and it runs on JRuby too. Plus, if you're using any ActiveRecord adapter(s) compatible with these databases – e.g. mysql2_makara, mysql2spatial, postgis, postgresql_makara, seamless_database_pool, spatialite – then those all work too. If you don't see an adapter you're using listed let us know.

So back to goal of this article: You've got lots of data that you want to import it efficiently. Let's start with a simple example, and walk thru modifications to improve performance, then step back and look at some benchmarks.

A Simple Example

Let's import 100,000 books with the following schema:

create_table :books do |t|
t.column :name, :string, null: false
t.column :description, :string
end

Here's a simple solution using ActiveRecord:

class Book < ActiveRecord::Base
end

# convert_csv_to_books is a method that converts CSV into an
# array of Book models, returning an array of attributes
convert_csv_to_book_attributes.each do |attrs|
Book.create!(attrs)
end

The above code is dead simple, but it also takes ~97 seconds to run. That's a looooong time.

Why is ActiveRecord slow?

Every time you create a record with ActiveRecord a single INSERT statement is generated and sent off to the database.

This means that we send 100,000 separate statement to the database, the database has to parse 100,000 individual statements, open and close the table 100,000 times for writing, write the data, insert/update indexes 100,000 times, and possibly acquire/release 100,000 row-based locks. That is a lot of stuff for the database to do and it takes an enormous amount of time to do it all.

Let's make a small change to speed this up.

Speed increase by importing models with validations

Instead of using create! let's build up our Book instances in memory and pass them into Book.import:

# convert_csv_to_books is a method that converts CSV into an
# array of Book models, returning an array of attributes
books = convert_csv_to_book_attributes.map do |attrs|
Book.new(attrs)
end

Book.import books

This took ~5 seconds. Wow – that's a 19x speed increase!

By default, the import method will continue to enforce validations and it will figure out how to serialize all of the Book instances into highly performant SQL statements.

One thing to note: validations like uniqueness that check against the database by issuing SELECT queries will not know about duplicates in your dataset. If you do not have DB-level constraints enforcing uniqueness you may get duplicates if your imported dataset contains duplicate values.

Speed increase by importing models without validations

Sometimes when preparing large batches of data we knead the data ahead of, can trust that it's valid based on where it came from, or have DB-level constraints that enforce validations. When we do we don't need to run ActiveRecord validations during the import.

For an extra performance boost we can turn validations off during the import process:

# convert_csv_to_books is a method that converts CSV into an
# array of Book models, returning an array of attributes
books = convert_csv_to_book_attributes.map do |attrs|
Book.new(attrs)
end

Book.import books, validate: false

This took ~4.6 seconds on our 100,000 books for a 21x speed increase.

Here we've set validate: false telling Book.import to skip validations. The validate option also accepts true to enforce validations, but this is the default so it can be omitted when you want to keep validations.

Speed increase by importing columns and values with validations

Sometimes we already have the data in an array of values and all we need to do is match up what columns they get imported with. If you want to skip building
up the in-memory Book instances yourself you can pass an array of columns and an array of values into import:

columns = [:title, :description]

# E.g. [ ['Book #1', 'Good book'], ['Book #2', 'Great Book'], ...]
array_of_book_attrs = convert_csv_to_book_attributes

Book.import columns, array_of_book_attrs, validate: true

This takes about ~7.5 seconds. It's a huge improvement from the original 97 seconds, but it's a step back from just passing in models.

Note: This is a temporary performance hit that will likely be fixed in the 0.15 version of activerecord-import since the recently released 0.14 included a performance optimization when passing in model instances directly. That same optimization has yet to be applied to passing in columns and values.

Speed increase by importing columns and values without validations

When we don't need to build up in-memory Book instances or run validations we can achieve another performance performance by passing in an array of columns, values, and by turning off validations:

columns = [:title, :description]

# E.g. [ ['Book #1', 'Good book'], ['Book #2', 'Great Book'], ...]
array_of_book_attrs = convert_csv_to_book_attributes

Book.import columns, array_of_book_attrs, validate: false

This takes our import time down to ~2.5 seconds. This is the biggest performance improvement yet – a 38x speed increase!

The savings we see here is due to not having to create in-memory Book instances. Since we're not validating, we can go directly to generating performance SQL and skip a lot of ActiveRecord hoopla.

Benchmarking

The above examples are the simplest ways to improve batch loading with activerecord-import. Let's use the above schema and see how performance is affected across MySQL (InnoDB), PostgreSQL, and SQLite3 databases.

For a quick reference, here's our schema again:

create_table :books do |t|
t.column :name, :string, null: false
t.column :description, :string
end

MySQL ResultsUsing ActiveRecord 4.2.4 (seconds)


# of records ActiveRecord #create import(models) w/validations import(models) w/o validations import(cols, vals) w/validations import(cols, vals) w/o validations
10 0.017 0.001 0.001 0.002 0.001
100 0.119 0.006 0.006 0.009 0.004
1,000 0.94 0.05 0.043 0.08 0.025
10,000 9.703 0.582 0.433 0.81 0.248
100,000 97.426 4.965 4.662 7.491 2.47

PostgreSQL ResultsUsing ActiveRecord 4.2.4 (seconds)

# of records ActiveRecord #create import(models) w/validations import(models) w/o validations import(cols, vals) w/validations import(cols, vals) w/o validations
10 0.034 0.002 0.001 0.002 0.001
100 0.108 0.014 0.009 0.009 0.003
1,000 1.075 0.066 0.064 0.074 0.031
10,000 10.503 0.728 0.594 0.764 0.273
100,000 104.788 7.324 6.829 7.449 2.841

SQLite3 ResultsUsing ActiveRecord 4.2.4 (seconds)

# of records ActiveRecord #create import(models) w/validations import(models) w/o validations import(cols, vals) w/validations import(cols, vals) w/o validations
10 0.022 0.002 0.002 0.002 0.001
100 0.168 0.009 0.007 0.01 0.003
1,000 1.613 0.069 0.057 0.079 0.027
10,000 16.894 0.764 0.594 0.783 0.28
100,000 164.348 7.45 6.621 7.531 2.562

Summary

With only a few lines of code we saw how activerecord-import can speed up our large data imports by 13x to nearly 40x. With a minimal API and almost no effort to use activerecord-import is a developer's best friend. You should spend time working on building the application and giving users fresh data – not waiting for data to import.

As great as this is you may be thinking: How do I handle conflicts? Maybe you want to ignore duplicates, or perhaps you want to update particular columns when duplicate exist. In upcoming posts I'll be covering these topics and a few others:

  • Handling conflicts: Importing large datasets and updating existing records
  • Importing associations
  • Recursive imports
  • Controlling the INSERT batch size
  • Synchronizing inserts with existing in memory models
  • Updating or disabling updates to timestamp columns

If there are any particular topics you'd like to see covered go ahead and post to Github or send us a .

Happy coding!