Sunday, August 06, 2006

UUIDs as primary keys

I wrote in my last post that I preferred to let the database just store data and and implement all the business logic in my Java code but that I usually let the database handle primary key generation.

After thinking about it for a few days, I decided to experiment with using UUIDs
as primary keys. Java 5 has a convenient java.util.UUID class that I used to
replace the integer identifiers in my entity classes.

I found a bunch of blog entries about using UUID versus integer primary keys. UUIDs do have some disadvantages:



  • UUIDs are larger than integer IDs. A UUID, when stored in its most compact form, consumes 128 bits, or 16 bytes. That's four times larger than a regular 32-bit integer.

  • UUIDs are binary. While they fit in a 16-byte binary column, when expanded into a string, they typically consume 32 or 36 characters, depending on whether the string is the human-friendly 00112233-4455-6677-8899-aabbccddeeff format.

  • UUIDs are computationally harder to generate than integers.



The disadvantages really boil down to space and speed. But the more I thought about it, the more I thought that there were significant higher-level advantages:



  • UUIDs can be generated by the application itself, without having to write the object to the database or consult the database in order to determine the range of integer IDs already in use. That simplifies unit testing. And since the UUID is immediately available, persistent classes can implement equals and hashCode in terms of it, rather than falling back on object identity.

  • A UUID is unique not only within a single class but across all classes and all JVMs.


    • This allows relationships that were previously implemented as <many-to-one> to be re-implemented as <one-to-one> when that is really the nature of the relationship.

    • This also allows the construction of tables that support references to "any" object without having to include explicit type information in the table. For example, an audit log table could have one column for the UUID of the object that was changed without having to maintain track of what kind of object it was. That in turn simplifies the interface to access the audit log entries.

    • Heterogenous distributed systems can exchange information about objects without identifier conflicts.


  • Using UUIDs in URLs and in web forms makes applications more secure. If a user sees a URL that looks like http://www.something.com/editmessage?id=50 then the user may be tempted to change 50 to 51 to see if he or she can edit someone else's message. If the message identifier is 14809d2d-8c71-473f-a5cd-123649cd9624 instead of 50, then it's unlikely that the user will be able to find another identifier that identifies a real message. This is the same reason why your MasterCard or Visa has a 16-digit account number even though there are certainly less than 10,000,000,000,000,000 cardholders.



It's true that UUIDs consume more space and require more computing power to manage than integers. But while this is the kind of argument that I would have considered back in 1994, computers today are roughly one thousand times more powerful than they were then, and I don't think that slinging around 128-bit keys rather 32-bit keys is going to make a difference in the overall performance of any but the most speed-critical applications. Throughout the history of computing, we have repeatedly taken advantage of ever-increasing performance to abstract away complexity and solve higher-level problems. Java itself is an example of this trend. It is undoubtedly more efficient in terms of computing power to skip the JVM and interact with the operating system and memory directly, but we pay the price of using the JVM because we see benefit from the tradeoff.

Unfortunately Hibernate doesn't know how to persist the java.util.UUID type out of the box (though I have to believe that a future version will), so I wrote a couple of Hibernate UserType implementions. One of them persists the UUID to a 16-byte binary column, while the other persists the UUID to a 36-character char column. I used 36 characters instead of 32 because UUID has toString and fromString methods that generate and consume UUID strings in the 00112233-4455-6677-8899-aabbccddeeff format.

In my persistent classes, I'm using UUID.randomUUID to generate the UUIDs. I don't do it in the constructor, though, because then I'd be generating UUIDs even when loading objects from the database. Instead, I initialize the UUID lazily, in the getUuid method. Since I'm using property access for the UUID property, this lets the object construct a new UUID when needed but not incur the cost of creating UUIDs that are just going to be overwritten with different UUIDs loaded from the database. I benchmarked UUID.randomUUID and found it to be pretty fast--I created 100,000 UUIDs in about 750ms--but there's no need to accept the cost when the workaround is so straightforward.

On the database side, I'm using Oracle, so I created the UUID column as RAW(16). Oracle provides to functions for working with RAW columns, HEXTORAW and RAWTOHEX. You can probably guess what they do. Oracle also provides a SYS_GUID function that will create a UUID for you, so you can define the UUID columns with DEFAULT SYS_GUID() in order to avoid having to create UUIDs yourself when manipulating the database directly.

Update (May 7, 2007): Zach Nichter has done some analysis of how UUIDs (he calls them GUIDs) perform as primary keys with SQL Server.

I switched from using Oracle to PostgreSQL, and I'm now using bytea columns to hold the UUIDs. It's unfortunate that there's no UUID type in PostgreSQL, especially since the database supports a wide range of types. A UUID type would be more efficient than bytea, because bytea columns maintain the length of each value on disk, which is always 16 in the case of UUIDs.

6 Comments:

At 7:33 PM, Blogger Jeremy Watkins said...

do you still have the usertype implementation for uuids in a 16bit binary column? i plan to implement this in my application and having your usertype would certainly save me some time

 
At 11:48 AM, Blogger Dean Hiller said...

The note on JVM is a bit off due to the hotspot compiler ;). There is essentially no abstraction after going through the hotspot compiler eliminating any performance slowdown. pre-hotspot compiler, you would be correct. Of course, startup times take a bit more time as it has to run and compile to the native processor binary code.

 
At 7:38 AM, Blogger Jhon Marshal said...

It was another joy to see your post. It is such an important topic and ignored by so many, even professionals. I thank you to help making people more aware of possible issues. Great stuff as usual...
online trading

 
At 3:10 AM, Blogger umer said...

Such a nice blog and very nice you work and sharing this wonderful article about the long term forex signals well done.
long term forex signals

 
At 11:29 AM, Blogger Stewart Morgan said...

I visited your blog for the first time and just been your fan. Quite informative post. I Will be back often to check up on new stuff you post!
Earn money while backpacking

 
At 1:43 AM, Blogger Terrance McDaniel said...

Thanks for the information you share it about the online make money and I really must appreciate your work.
MLM

 

Post a Comment

<< Home