The other week we took a look at importing data quickly into Ruby on Rails applications and saw how activerecord-import can speed up importing large sets of data, by 13x to nearly 40x, with just a few lines of code. One thing that post didn't cover was how to handle conflicts with our data. Should we ignore the duplicates, update the duplicates, or let it fail noisily?
In this post we'll take a look at updating the duplicates.
A Simple Example
Let's say that we have an authors database with the following schema:
We want to pull in an updated authors feed that will update the author names in our system. We may have misspellings, use initials where the author prefers their name expanded or vise versa, or perhaps the author has changed their name. Whatever it is we don't own the authors data so we want an upstream dataset to be used to make sure we've got up-to-date authors information.
In the above schema we are using the key
field as the globally unique identifier for each author.
For the simplicity of this example our authors data already has a key
that matches up with a corresponding key
in the upstream data feed that we're going to import.
Running the import
For the time being let's assume that the only piece of author information we're interested in is the name. We can utilize the :on_duplicate_key_update
option of activerecord-import's import
method to specify that we want to update name
when a duplicate is found:
The above code will efficiently INSERT
new records into the database or update the name
column when a duplicate is found.
How the above import works (MySQL)
If the data being inserted would cause a duplicate value in a UNIQUE
index or PRIMARY KEY
then MySQL will perform an UPDATE
on the existing row. It will update the columns based on the list of columns you provide to the :on_duplicate_key_update
option.
This relies on the underlying INSERT ... ON DUPLICATE KEY UPDATE functionality provided by MySQL.
How the above import works (PostgreSQL)
If the data being inserted would cause a duplicate value on the PRIMARY KEY
field then PostgreSQL will perform an UPDATE
on the existing row. It will update the columns based on the list of columns you provide to the :on_duplicate_key_update
option.
The above example would fail in PostgreSQL if key
weren't the primary key on the authors table, even if key
had a UNIQUE
index or a UNIQUE
constraint.
This relies on the underlying INSERT ... ON CONFLICT functionality provided by PostgreSQL (only available in 9.5 and higher).
Specifying how to detect duplicates with PostgreSQL
With PostgreSQL activerecord-import supports passing a hash to :on_duplicate_key_update
. The available options are:
- columns — the array of columns to be updated when there is a duplicate
- conflict_target — the column(s) or index expression that PostgreSQL can use to infer an index from for detecting duplicates. Don't pass the name of the index here, just the name of the columns in the index. Use this or
constraint_name
, but not both. - constraint_name — the name of the
CONSTRAINT
to use to detect duplicates. Unlikeconflict_target
, do not pass in the column name(s) that make(s) up the constraint; instead pass in the name of the constraint. Use this orconflict_target
, but not both.
In the earlier example we added a UNIQUE
index on authors.key
. If the primary key on the table was the id
field we would use the following import
call to ensure our author names got updated:
To see how constraint_name
is used let's remove the unique index on the authors.key
and add an actual CONSTRAINT
:
Here's the updated call to import
:
In case you're wondering, ActiveRecord doesn't provide any methods for creating actual PostgreSQL database constraints, so that is why the above schema change executed raw SQL.
How about SQLite3?
SQLite3 doesn't provide an equivalent upsert implementation. The closest thing it currently supports is INSERT OR REPlACE
. Rather than updating existing columns it can be used to replace an entire row with a new row.
activerecord-import currently doesn't provide any support for this SQLite3 feature.
Related bits
Why validate: false
in the above examples?
Let's say that the author model looked like this:
The above uniqueness validation will run a query every time valid?
is called on an author instance. Unfortunately, activerecord-import is not able to batch validate. Instead, it will try to validate each author instance individually. That will cause 10,000 SELECT ... FROM authors WHERE key = ...
queries to hit the database before the import.
In the context of the above example we didn't need to run this validation since we're relying on a database-level UNIQUE
index/constraint in order to trigger an update. Because of this we were able to the import without validations.
If we would have kept validations turned on nothing bad would have happened except the import would have gone a wee bit slower.
Use database level indexes/constraints for uniqueness
The validation helpers provided by Rails are useful, but they're not enough for fighting the war against duplicate data.
On its own a validates :column, uniqueness: true
line in an ActiveRecord model won't actually ensure that you have unique data, nor will it be enough to utilize the :on_duplicate_key_update
option in activerecord-import.
Summary
We saw previously how activerecord-import's on_duplicate_key_update
option can be used to tell the database what columns to update when it finds a duplicate.
Since MySQL and PostgreSQL support different call semantics this resulted in a few variations on how on_duplicate_key_update
can be used. For MySQL, it's a simple collection of columns to update, whereas with PostgreSQL it could be that or it could be a Hash of :columns
and either :conflict_target
or :constraint_name
.
Now that we know how to leverage the speed and efficiency of activerecord-import with new data as well as updating existing data, we'll take a look next at ignoring unique key and constraint violations.
If there are any particular topics you'd like to see covered go ahead and post to Github or send us a tweet.
Happy coding!
Image credit: Thomas Quine