Recently, we needed to iterate over a fairly large data set (on the order of millions) and do the ever-common If it’s not in the database, put it in. If it’s already there, just update some fields. It’s a pattern that is very common for things like log files (where, for example, only a timestamp needs to be updated in some cases).
The obvious way of doing a SELECT, followed by either an UPDATE or an INSERT is too slow for even moderately-large datasets. The better way to accomplish this is to use MySQL’s ON DUPLICATE KEY UPDATE directive. By simply creating a unique key on the fields that should be different per-row, this syntax provides two specific benefits:
- Allows batch (read: transaction) queries for large data
- Increases performance overall versus making two separate queries
These benefits are especially helpful when your dataset is too large to fit into memory. The obvious drawback to this method, however, is that it may put additional load on your database server. Like anything else, it’s worth testing out your individual situation but, for us, ON DUPLICATE KEY UPDATE was the way to go.