Application frameworks often provide their own out of the box services for interacting with databases. This works great when processing data in a onesie-twosie fashion, but what about large batches of data?
I ran into this problem over a decade ago with the not yet 1.0 Ruby on Rails framework. I had a lot of data that I needed to routinely integrate into a local application, but ActiveRecord's mechanism for creating records was too slow. It was taking several hours. At that time I wrote a library for importing large batches of data with ActiveRecord in an efficient manner. The job that took several hours now took under two minutes. That library today is known as the activerecord-import and its available as a rubygem.
Over the years activerecord-import has become the defacto standard for efficiently importing large batches of data with ActiveRecord. It has been maintained by an awesome community of contributors. Not only has it kept up with Rails development, but it works with several databases. The big three being: MySQL, SQLite3, PostgreSQL. Oh, and it runs on JRuby too. Plus, if you're using any ActiveRecord adapter(s) compatible with these databases – e.g. mysql2_makara, mysql2spatial, postgis, postgresql_makara, seamless_database_pool, spatialite – then those all work too. If you don't see an adapter you're using listed let us know.
So back to goal of this article: You've got lots of data that you want to import it efficiently. Let's start with a simple example, and walk thru modifications to improve performance, then step back and look at some benchmarks.
A Simple Example
Let's import 100,000 books with the following schema:
create_table :books do |t|
t.column :name, :string, null: false
t.column :description, :string
end
Here's a simple solution using ActiveRecord:
class Book < ActiveRecord::Base
end
# convert_csv_to_books is a method that converts CSV into an
# array of Book models, returning an array of attributes
convert_csv_to_book_attributes.each do |attrs|
Book.create!(attrs)
end
The above code is dead simple, but it also takes ~97 seconds to run. That's a looooong time.
Why is ActiveRecord slow?
Every time you create
a record with ActiveRecord a single INSERT
statement is generated and sent off to the database.
This means that we send 100,000 separate statement to the database, the database has to parse 100,000 individual statements, open and close the table 100,000 times for writing, write the data, insert/update indexes 100,000 times, and possibly acquire/release 100,000 row-based locks. That is a lot of stuff for the database to do and it takes an enormous amount of time to do it all.
Let's make a small change to speed this up.
Speed increase by importing models with validations
Instead of using create!
let's build up our Book
instances in memory and pass them into Book.import
:
# convert_csv_to_books is a method that converts CSV into an
# array of Book models, returning an array of attributes
books = convert_csv_to_book_attributes.map do |attrs|
Book.new(attrs)
end
Book.import books
This took ~5 seconds. Wow – that's a 19x speed increase!
By default, the import
method will continue to enforce validations and it will figure out how to serialize all of the Book
instances into highly performant SQL statements.
One thing to note: validations like uniqueness that check against the database by issuing SELECT queries will not know about duplicates in your dataset. If you do not have DB-level constraints enforcing uniqueness you may get duplicates if your imported dataset contains duplicate values.
Speed increase by importing models without validations
Sometimes when preparing large batches of data we knead the data ahead of, can trust that it's valid based on where it came from, or have DB-level constraints that enforce validations. When we do we don't need to run ActiveRecord validations during the import.
For an extra performance boost we can turn validations off during the import process:
# convert_csv_to_books is a method that converts CSV into an
# array of Book models, returning an array of attributes
books = convert_csv_to_book_attributes.map do |attrs|
Book.new(attrs)
end
Book.import books, validate: false
This took ~4.6 seconds on our 100,000 books for a 21x speed increase.
Here we've set validate: false
telling Book.import
to skip validations. The validate
option also accepts true
to enforce validations, but this is the default so it can be omitted when you want to keep validations.
Speed increase by importing columns and values with validations
Sometimes we already have the data in an array of values and all we need to do is match up what columns they get imported with. If you want to skip building
up the in-memory Book
instances yourself you can pass an array of columns and an array of values into import
:
columns = [:title, :description]
# E.g. [ ['Book #1', 'Good book'], ['Book #2', 'Great Book'], ...]
array_of_book_attrs = convert_csv_to_book_attributes
Book.import columns, array_of_book_attrs, validate: true
This takes about ~7.5 seconds. It's a huge improvement from the original 97 seconds, but it's a step back from just passing in models.
Note: This is a temporary performance hit that will likely be fixed in the 0.15 version of activerecord-import since the recently released 0.14 included a performance optimization when passing in model instances directly. That same optimization has yet to be applied to passing in columns and values.
Speed increase by importing columns and values without validations
When we don't need to build up in-memory Book
instances or run validations we can achieve another performance performance by passing in an array of columns, values, and by turning off validations:
columns = [:title, :description]
# E.g. [ ['Book #1', 'Good book'], ['Book #2', 'Great Book'], ...]
array_of_book_attrs = convert_csv_to_book_attributes
Book.import columns, array_of_book_attrs, validate: false
This takes our import time down to ~2.5 seconds. This is the biggest performance improvement yet – a 38x speed increase!
The savings we see here is due to not having to create in-memory Book
instances. Since we're not validating, we can go directly to generating performance SQL and skip a lot of ActiveRecord hoopla.
Benchmarking
The above examples are the simplest ways to improve batch loading with activerecord-import. Let's use the above schema and see how performance is affected across MySQL (InnoDB), PostgreSQL, and SQLite3 databases.
For a quick reference, here's our schema again:
create_table :books do |t|
t.column :name, :string, null: false
t.column :description, :string
end
MySQL ResultsUsing ActiveRecord 4.2.4 (seconds)
# of records | ActiveRecord #create | import(models) w/validations | import(models) w/o validations | import(cols, vals) w/validations | import(cols, vals) w/o validations |
---|---|---|---|---|---|
10 | 0.017 | 0.001 | 0.001 | 0.002 | 0.001 |
100 | 0.119 | 0.006 | 0.006 | 0.009 | 0.004 |
1,000 | 0.94 | 0.05 | 0.043 | 0.08 | 0.025 |
10,000 | 9.703 | 0.582 | 0.433 | 0.81 | 0.248 |
100,000 | 97.426 | 4.965 | 4.662 | 7.491 | 2.47 |
PostgreSQL ResultsUsing ActiveRecord 4.2.4 (seconds)
# of records | ActiveRecord #create | import(models) w/validations | import(models) w/o validations | import(cols, vals) w/validations | import(cols, vals) w/o validations |
---|---|---|---|---|---|
10 | 0.034 | 0.002 | 0.001 | 0.002 | 0.001 |
100 | 0.108 | 0.014 | 0.009 | 0.009 | 0.003 |
1,000 | 1.075 | 0.066 | 0.064 | 0.074 | 0.031 |
10,000 | 10.503 | 0.728 | 0.594 | 0.764 | 0.273 |
100,000 | 104.788 | 7.324 | 6.829 | 7.449 | 2.841 |
SQLite3 ResultsUsing ActiveRecord 4.2.4 (seconds)
# of records | ActiveRecord #create | import(models) w/validations | import(models) w/o validations | import(cols, vals) w/validations | import(cols, vals) w/o validations |
---|---|---|---|---|---|
10 | 0.022 | 0.002 | 0.002 | 0.002 | 0.001 |
100 | 0.168 | 0.009 | 0.007 | 0.01 | 0.003 |
1,000 | 1.613 | 0.069 | 0.057 | 0.079 | 0.027 |
10,000 | 16.894 | 0.764 | 0.594 | 0.783 | 0.28 |
100,000 | 164.348 | 7.45 | 6.621 | 7.531 | 2.562 |
Summary
With only a few lines of code we saw how activerecord-import can speed up our large data imports by 13x to nearly 40x. With a minimal API and almost no effort to use activerecord-import is a developer's best friend. You should spend time working on building the application and giving users fresh data – not waiting for data to import.
As great as this is you may be thinking: How do I handle conflicts? Maybe you want to ignore duplicates, or perhaps you want to update particular columns when duplicate exist. In upcoming posts I'll be covering these topics and a few others:
- Handling conflicts: Importing large datasets and updating existing records
- Importing associations
- Recursive imports
- Controlling the
INSERT
batch size - Synchronizing inserts with existing in memory models
- Updating or disabling updates to timestamp columns
If there are any particular topics you'd like to see covered go ahead and post to Github or send us a tweet.
Happy coding!