Auto-incrementing + unique + clustered + integer columns are one of the best candidates for ID columns. However, if you take auto-increments for granted, you’ll be in for some nasty surprises when you employ them in multi-master and sharded setups.
There are scenarios with both InnoDB and MyISAM where the auto_increment value ends being reused. In particular, InnoDB resets the next auto_increment value to its highest value in the table + 1 after a server restart (because the in-memory current value of auto_increment is not persisted). This means that if you delete the highest value(s) in the table, then restart, you can get the same value for auto_increment again.
This poses problems in case of data migration and archiving. If you move data to another shard, there are chances of ID collision and while archiving, new data and archived data might have two different entries with the same key.
To avoid such issues, here are a few approaches with their pros and cons discussed alongside:
Approach#1 - Using GUID
This gives good probabilistic guarantee of uniqueness of ID values across shards and masters but following practical issues make it less desirable:
- Column size becomes four times larger than an integer column. If indexes don’t fit into memory, it’s a severe performance hit.
- GUIDs being non-sequential leads to poor query performance because of increased disk I/O.
- Non-sequential nature of GUIDs might not be acceptable at all in some cases.
- MySQL doesn’t have native support for GUIDs.
A libarary providing sequential/partially sequential GUIDs would fix problem #2 and #3 but problem #1 remains and is a big reason to avoid GUIDs in many cases.
Approach#2 - Using auto_increment_increment and auto_increment_offset
Use auto_increment_increment and auto_increment_offset system variables in MySQL. This avoids having to write your own system for managing the range of IDs.
auto_increment_increment is the value for delta between two ID values and auto_increment_offset is the value to start from. Essentially, ID values are generated by finding the least value in the series that is greater than the maximum existing value in AUTO_INCREMENT column where series is defined by:
auto_increment_offset + N × auto_increment_increment
If you set auto_increment_increment to a reasonably high number, the chances of ID collision on row deletion and server restart are diminished but most of other problems still remain. And if you have lot of shards, you might not have the luxury of setting this to a high value.
Approach#3 - Using Node Identifier
Have a node ID column in all the tables so that the primary key becomes a composite of the node identifier and the ID column. Your tables would like:
CREATE TABLE `my_table` ( `node_id` bigint(20) NOT NULL, `my_table_id` bigint(20) NOT NULL AUTO_INCREMENT, `some_other_column` varchar(255) NOT NULL, PRIMARY KEY (`node_id`,`my_table_id`), UNIQUE KEY `my_table_id` (`my_table_id`) ) ENGINE=InnoDB;
Now you need both node_id and my_table_id values of an entry to look up. This scheme solves the problem of uniqueness across shards/masters but it still involves lot of work when data has to be moved across shards as the node id changes in that case and hence my_table_id has to change as well to retain uniqueness.
Approach#4 - ID Service
Have a single source of ID generation. This service should provide ID generator tables corresponding to all the tables in your application database.
CREATE TABLE `id_svc` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `dummy` char(1) NOT NULL DEFAULT '', PRIMARY KEY (`id`), UNIQUE KEY `dummy` (`dummy`) ) ENGINE=InnoDB
Use the “INSERT ON DUPLICATE KEY UPDATE” statement to atomically update the single ID row and get the next ID.
For the ID service to be highly available, one can setup shards for ID server and load balance between them. Within the shards, use the auto_increment_increment and auto_increment_offset approach discussed above to distribute the ID space between these shards.
Most of the approaches discussed above have merits and demerits but they all have been used in large scale production systems and it really depends on the problem at hand to choose the right one from among them.