Comb ID misuse | CodeCave

How it started

I used EventFlow’s Comb IDs as my default way of generating unique identifiers. I thought they were the perfect balance: globally unique but still kind to the database index because they were sequential.

It turned out that wasn’t really true, at least not for databases like PostgreSQL which I frequently use.

Here’s what I learned about Comb GUIDs and why you shouldn’t blindly use them in PostgreSQL (and maybe most databases out there).

note: you may find similar functions in other libraries (RT.Comb, UUIDNext, NHibernate to name a few), and it applies to those as well. I use EventFlow’s function to demonstrate because I made the mistake with it myself, and saw some of my colleagues do the same.

What are comb IDs?

Comb IDs are basically GUIDs with a timestamp baked in. They’re not completely random, part of the ID changes in a predictable, sequential way over time.

EventFlow, the .NET CQRS/ES framework, has Comb ID generation as a utility. Some implementations stick the timestamp at the end of the GUID’s byte array (see here).

Why comb IDs are supposed to help

In theory, the benefit is simple: sequential inserts.

If your IDs are generated in order, your database indexes stay in order too. Less random inserts, fewer page splits, less fragmentation.

A helpful metaphor when thinking abuot index fragmentation is this: Imagine you’re in a library searching for a series of books. Each book contains a note inside that tells you exactly where to find the next book. As you follow the notes, you discover that some books point to shelves far away, scattered all over the library, instead of being neatly arranged on the same or nearby shelves.

note: Whether index fragmentation is even something worth worrying about is another story. It really depends on the situation. There are plenty of benchmarks out there if you want to dig into it.

The detail I missed

Here’s the part I never really paid attention to: databases don’t all sort or store GUIDs the same way.

MSSQL: Sorts by the last chunk of bytes (See this blog)
PostgreSQL: Stores UUIDs as a 16-byte binary value and sorts from the first byte onward
Lexical sorting: If you store UUIDs as strings, sorting goes left to right (just like text)

Why it doesn’t work in Postgres

EventFlow’s Comb IDs put the timestamp in the last bytes of the GUID. PostgreSQL sorts UUIDs by the first bytes (binary or string). That means the sequential part is ignored during sorting and indexing. Inserts are still scattered randomly throughout the index.

So yeah, no magic. Just fragmentation and the same performance trade-offs you’d get with random GUIDs.

What to use instead

UUID standards like UUIDv7 actually put the timestamp in the first bytes. This works beautifully with PostgreSQL because it sorts UUIDs from the front.

(This is essentially still Comb ID)

My quick experiment

I inserted about 100,000 rows into Postgres using EventFlow’s GuidFactories.Comb.Create() as the primary key.

Then I checked the index stats:

postgres=# SELECT COUNT(*) FROM comb_id_test;
 count
--------
 100000
(1 row)

postgres=# SELECT * FROM pgstatindex('comb_id_test_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
       4 |          2 |    4530176 |           295 |              4 |        548 |           0 |             0 |            63.07 |                 50
(1 row)

Check that leaf fragmentation number, it means the logical page order doesn’t line up with the physical order. That hurts scan performance. To clean it up, you’ll need a VACUUM FULL (which means downtime) or REINDEX (which rebuilds the index from scratch).

Then I ran the same test using UUIDv7 (Guid.CreateVersion7()) (It’s built into .NET 9, so you don’t even need a separate library if all you want is database-friendly UUIDs)

And here’s the difference:

postgres=# SELECT COUNT(*) FROM uuidv7_test;
 count
--------
 100000
(1 row)

postgres=# SELECT * FROM pgstatindex('uuidv7_test_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
       4 |          2 |    3178496 |           295 |              3 |        384 |           0 |             0 |            89.84 |                  0
(1 row)

Zero fragmentation.

Lessons learned

Comb IDs are not universally beneficial. Their effectiveness depends on the database’s sorting and storage logic.
For PostgreSQL, MSSQL-targeted Comb IDs don’t provide any advantage over random UUIDs.
If you have to use EventFlow’s GuidFactories.Comb, use CreateForString() instead of Create().
In certain cases where you don’t need to generate IDs upfront, let the database generate comb IDs, so you don’t have to deal with all of this.