The perils of UUID primary keys in SQLite

upvote

The perils of UUID primary keys in SQLite

(andersmurphy.com)

134 points

by emschwartz18 hours ago |

upvote

by sedatk6 minutes ago|

[-]

UUIDv7 and sequential integers are quite similar. Sequential integers disclose count and neighboring IDs while UUIDv7 discloses timestamp. Either can be a security issue in certain cases.

So, UUIDv4 as a PK on a clustered index can be perfectly feasible for cases where you want to avoid disclosing stuff and row insertion performance isn’t that important.

reply

upvote

by blopker17 hours ago|

[-]

UUIDs are way over used. There is almost always a better key to use, usually a bigint for databases. If you're making some kind of leaderless distributed data store, then maybe, but even then there are other ID sharding strategies I'd go for first depending on the constraints.

For a single database, bigints are smaller and faster, with less footguns.

UUIDs can be nice for an opaque public ID, however I'd still prefer something like a Sqid for space and usability.

reply

upvote

by Fabricio2015 hours ago|

[-]

> bigints are smaller and faster, with less footguns

But be careful!! Javascript WILL interpret your bigints as Number() and round them down because they are too big without telling you!!!

Famously seen by every snowflake user that has interacted with Javascript, quite an annoying problem.

reply

upvote

by silvestrov10 hours ago|

[-]

Good trick is to prefix all such keys with magic, i.e. a couple of letters that identify type type of key.

Then it will always be a string and you will be free to change the format/type of the key in the future to UUID or whatever you like.

reply

upvote

by zmj2 hours ago|

[-]

Rule of thumb: if you’re not doing math with a value, it’s not a number.

reply

upvote

by Piezoid7 hours ago|

[-]

Using a Feistel cipher and base 32 encoding at the boundaries of the system can help catching vibe coded edge code that attempt to decode identifiers in javascript. It also somewhat obfuscate the cardinalities and fill rate of the tables.

reply

upvote

by sheept8 hours ago|

[-]

This can be avoided by supplying a reviver:

    const json = '{ "a": 9007199254740993 }'
    JSON.parse(json, (_key, value, context) => /^\d+$/.test(context.source) ? BigInt(context.source) : value)

reply

upvote

by spiffytech15 hours ago|

[-]

Fortunately we're seeing more JS DB libraries offering to read large numbers as the BigInt type.

reply

upvote

by shakna12 hours ago|

[-]

But frustratingly, a JS BigInt is nothing like a BigInt in any other language.

In JS - BigInt is 64bit integer.

In anything else - BigInt is a arbitrarily large integer.

reply

upvote

by anematode12 hours ago|

[-]

Hm? JavaScript BigInts are arbitrary precision, and you need to use methods like BigInt.asIntN(64, a) to convert them to 64 bits

reply

upvote

by mort9610 hours ago|

[-]

I hate this so much because you can’t nicely serialise a BigInt as JSON. Using a string is nicer but it only makes sense where int64 is used as an ID, not where it’s used as a number; and you don’t wanna have to configure this per field per query.

reply

upvote

by marcosdumay7 minutes ago|

[-]

IMO, I'm tending toward thinking that having types on your readable serialization format is a mistake, and that they should be always input to the (de)serializer instead.

reply

upvote

by sheept8 hours ago|

[-]

You can serialize a BigInt by specifying a replacer:

    const obj = { a: 9007199254740993n }
    JSON.stringify(obj, (_key, value) => typeof value === 'bigint' ? JSON.rawJSON(value.toString()) : value)

reply

upvote

by mort967 hours ago|

[-]

And then you end up with strings on the other side, not numbers.

reply

upvote

by sheept7 hours ago|

[-]

No you don't? The example I gave produces

    {"a":9007199254740993}

not

    {"a":"9007199254740993"}

reply

upvote

by mort963 hours ago|

[-]

Oh, that's much worse! The JSON string `{"a":9007199254740993}` decodes to the object `{"a":9007199254740992}` with typical JSON parsers like JavaScript's `JSON.parse`.

reply

upvote

by sheept3 minutes ago|

[-]

If you're applying a replacer, then you'd supply a reviver when parsing:

    const json = '{ "a": 9007199254740993 }'
    JSON.parse(json, (_key, value, context) => /^\d+$/.test(context.source) ? BigInt(context.source) : value)

reply

upvote

by nh29 hours ago|

[-]

JSON has arbitrary length numbers in the spec only.

reply

upvote

by mort967 hours ago|

[-]

Completely and utterly irrelevant.

reply

upvote

by Etheryte11 hours ago|

[-]

This is simply not true? Or maybe I misunderstand what you mean?

reply

upvote

by paulddraper14 hours ago|

[-]

!!

Node.js drivers will correctly read int64 as string or bigint, not number.

E.g. pg for PostgreSQL

Maybe there’s a buggy driver but I don’t know it.

reply

upvote

by Fabricio2030 minutes ago|

[-]

Browser!! The browser reads it as Number. If your rest api returns {"id": 1324535222364012585} for example, javascript will try and parse that as number from the response!!!

You can of course, change the api such that it does {"id": "1324535222364012585"} instead and voila, it will no longer try parsing it as number. Or the many other workarounds people have recommended above (like appending a prefix, or using a different encoding), but why is it trying to parse a number thats too big and instead of throwing it just rounds down without telling you????!

reply

upvote

by Fire-Dragon-DoL35 minutes ago|

[-]

Providing an ID from the client is a big advantage that's missing though. Especially if you want a UI with optimistic rendering that's dealing with something async

reply

upvote

by JamesSwift15 hours ago|

[-]

UUIDs also have a nice benefit of it being impossible to query the wrong table with one if you mixup what an FK goes to

reply

upvote

by pyuser58315 hours ago|

[-]

Yeah this is nice - also helps with grepping dump files.

reply

upvote

by mamcx13 hours ago|

[-]

How is this done?

reply

upvote

by nickpeterson13 hours ago|

[-]

They just mean you catch incorrect joins more easily because there is usually no overlap in keys between unrelated tables. Using int, you’re usually going to have some shared values between two unrelated tables.

reply

upvote

by sudoshred13 hours ago|

[-]

Statistically impossible to inadvertently generate a collision using UUID keys. UUID is designed to be unique when generated across any computer system. Practically speaking if you have an exactly matching pair of UUIDs from disparate system you have found the exact record match. The name gives a hint "Universally unique identifier". -Not a cryptographer.

reply

upvote

by usrnm8 hours ago|

[-]

It definitely is possible, just very improbable

reply

upvote

by echoangle8 hours ago|

[-]

That’s probably what’s meant by statistically impossible.

reply

upvote

by 13 hours ago|

[-]

deleted

reply

upvote

by masklinn13 hours ago|

[-]

The U means if you join the wrong table your join will always come up empty.

It does not actually make it impossible to query the wrong table it just tells you quickly when you’ve done so.

reply

upvote

by chrismorgan8 hours ago|

[-]

You can achieve this with numeric sequences too, by having a consistent step and unique offset in all your sequences. For example, if you will never exceed 16 types, reserve four bits as the type discriminant. (You don’t have to use powers of two, but it may be convenient.)

All sequences use step 16.

Type A has discriminant/offset 0, yielding IDs {0, 16, 32, 48, 64, …}.

Type B has discriminant/offset 1, mapping to IDs {1, 17, 33, 49, 65, …}.

All the way up to Type P with discriminant/offset 15 and IDs {15, 31, 47, 63, 79, …}.

This is also trivially invertible so that you can determine the type from the ID.

A more common approach is to make IDs opaque strings and put a type prefix—A0, B12, P34, that kind of thing. But this way you can keep it as a number, if you wish.

reply

upvote

by PUSH_AX2 hours ago|

[-]

What are uuid foot guns?

reply

upvote

by andersmurphy11 hours ago|

[-]

Yes this matters even more if you are doing a lot of joins. Naive string UUIDs are 32 bytes (though I use binary uuid in the post which is 16) compared to 8 bytes for a 64-bit int. This matters even more with sqlite as it uses varint encoding. The upshot of all this is your indexes take up a lot less space in memory.

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by bob102916 hours ago|

[-]

I am finding UUIDs help a lot if your primary schema consumer is an LLM.

Inappropriate aliasing of integer keys allows for silent errors in queries because it will actually return some result a lot of the time. A UUID is immune to this problem. The model recognizes its mistake a lot more reliably when previously non-empty tables start showing up empty after attempting a join.

reply

upvote

by crubier15 hours ago|

[-]

No one ever got fired for using UUIDs

reply

upvote

by 17 hours ago|

[-]

deleted

reply

upvote

by adityaathalye6 hours ago|

[-]

Thanks for the benching, Anders! So grateful for the stuff you've shared over the years. Invariably, every single post has been useful and/or educational to me.

I read this post more as an illustration of the *value* of UUIDv7 as primary key, over integer primary keys, in lieu of minimal loss of read/write performance, and marginally more data on disk bloat.

SQLite's automatic integer rowID primary key is a no-brainer, when the SQLite application is local-only, such as application storage format (mobile and desktop). Or is never intended to grow beyond a single server instance. Basically, where each SQLite file is private to a singular instance of the application.

However, if there is even an outside chance of needing to cooperate across application instances, e.g. the minimal limit case of a personal knowledge base that should seamlessly sync across a person's devices, as well as a hosted service, then a high-quality sequential random ID starts to make a lot more sense. (No-brainer arbitrary table merges / splits / remerges, de-duplication, etc.)

Random ID primary key is a bad idea period, whether it be the UU kind or the SQ kind, or any other kind. As far as my DB knowledge goes, this class of ID destroys all tree-algorithms, and we are stuck with the fact that there is no practically better way, than an appropriate tree-structure, to group and organise a meaningful amount of data, efficiently and effectively.

reply

upvote

by andersmurphy1 hours ago|

[-]

I've updated the article with the correct rowid alias (integer not int) so the rowid version is now 715ms. I've also added an example of rowid and a secondary index UUID4, and that also seems to be bad for performance (as although it's not a clustered index it's still random inserts into a b-tree).

reply

upvote

by adityaathalye6 hours ago|

[-]

Aside: Specific to SQLite...

Thanks to its oh so convenient automatic integer rowIDs, I believe one can amortise some of the other overheads of UUIDv7s for "in-between" queries, viz. indices, joins, ctes, virtual tables etc., with appropriate schema / query design.

reply

upvote

by jdthedisciple9 hours ago|

[-]

So UUID isn't the problem but UUID v4 is, just like any random ID-scheme, correct?

UUID v7 so far seems like the best solution if you want UUID benefits and ordering.

reply

upvote

by scotty793 hours ago|

[-]

It's " WITHOUT ROWID" problem.

Why would you force database to order rows on the drive according to random id?

reply

upvote

by chromatin46 minutes ago|

[-]

If you had read the article, you'd have seen that UUIDv4 with Rowid was slightly slower than UUIDv7

reply

upvote

by andersmurphy11 hours ago|

[-]

This is actually a draft. I Wanted to add more details about how this changes with row size etc. I might get time to update it later today.

reply

upvote

by ysleepy6 hours ago|

[-]

Maybe you could explain why one would use "without rowid" in the first place.

I get saving 8 bytes per row seems attractive, but the tradeoff is not explained.

reply

upvote

by andersmurphy2 hours ago|

[-]

Update the article there's now a section for UUID4 with rowid. It's less bad than UUID4 without rowid but it's still about 4-6x slower than UUID7 without rowid.

reply

upvote

by keynha2 hours ago|

[-]

[flagged]

reply

upvote

by bambax10 hours ago|

[-]

Why would you use UUIDs a primary keys? Let SQLite use rowids internally (which is automatic and invisible), and have a different (indexed) column with UUID if you need that for publishing the ID somewhere.

reply

upvote

by elcomet7 hours ago|

[-]

UUID as key is useful when you have a distributed system where multiple workers create items independently

reply

upvote

by victorbjorklund3 hours ago|

[-]

Because another app can then create the id and add it to the db later.

reply

upvote

by ItsBob9 hours ago|

[-]

My rule for primary keys and id's is simple: Sequential integer (or bigint) as the PK and if I need to make it public, I have a GUID (or UUID) in the row too, e.g. tbl_person would have Id (int|bigint) and person_guid as (UUID).

The Integer id is used for joins and looks ups and such but that's it. If I need to send anything to the frontend or outside of the app/DB then that's the UUID.

reply

upvote

by pyuser58315 hours ago|

[-]

Oh gosh the ints v uuids debate for pks. This is worse than vim v eMacs or brackets v braces.

reply

upvote

by ac50hz10 hours ago|

[-]

I enjoy these carefully worded posts from Anders Murphy, illustrative and informative, not opinionated and preachy. Very useful, it’s great to see the process, and ofc bookmarkeable material for sharing with others.

reply

upvote

by w10-116 hours ago|

[-]

Isn't the solution just to use the rowid (after doing the read-id-after-insert dance)?

How much trouble does SQLite reysing rowid's actually cause?

reply

upvote

by andersmurphy11 hours ago|

[-]

You don't even need to that. SQLite auto increments the ids and is a single writer (which you should be coordinating at the application level.

Regular rowids are definitely the way to go if you can use them.

reply

upvote

by kjgkjhfkjf7 hours ago|

[-]

The script to create the benchmark numbers appears to be inserting 100 batches, not 10. (The benchmark numbers in the table appear to be consistent with the text, so I guess the actual script used to create them was correct.)

reply

upvote

by andersmurphy3 hours ago|

[-]

Yeah that was just a holdover from when I was playing with smaller batch sizes. It's not in the actual linked source.

reply

upvote

by yepyoukno18 hours ago|

[-]

Perils of “UUIDv4”. Everyone knows that’s what UUIDv7 was really for, and you should always convert that to binary to optimize everything.

reply

upvote

by antihero1 hours ago|

[-]

Doesn't Postgres' UUID type just do this for you anyway?

Why would you store it as as str column and not the inbuilt type for this?

https://www.postgresql.org/docs/current/datatype-uuid.html

If you are using SQLite well I guess that doesn't work.

reply

upvote

by JSR_FDED17 hours ago|

[-]

Small nit: uuid7 is 128 bits (16 bytes) by definition. So there’s no need to convert it to binary. It already is. Unless you’re working with a stringified version of the uuid7.

reply

upvote

by yepyoukno17 hours ago|

[-]

Oh yes, I meant don’t store as an ID in its string format!

reply

upvote

by dexterdog15 hours ago|

[-]

It's just s dumb as storing dates as strings, but people still do it.

reply

upvote

by littlecranky6710 hours ago|

[-]

But SQLite does not have a native datetime type so you have to use strings

reply

upvote

by voakbasda10 hours ago|

[-]

You can use an integer

reply

upvote

by Volundr3 hours ago|

[-]

How do I know the time zone of an integer? Sure there are plenty of cases where one doesn't care, but there are also many cases where the original time zone is important.

reply

upvote

by dexterdog26 minutes ago|

[-]

The integer is a UTC time so it can be sorted. If you need the time zone you store than in a smaller field.

reply

upvote

by littlecranky673 hours ago|

[-]

other comment said it already, timezone information is not saved. Easiest is just to use a string.

reply

upvote

by cenamus11 hours ago|

[-]

But also one of the recommended ways of doing it, as it has no native Datetime type.

reply

upvote

by themafia14 hours ago|

[-]

> and you should always convert that to binary to optimize everything

I disagree. I tried this once. Now you need a client access layer to touch the DB in any context. All your console tools no longer work well or at all. If they show up in URLs you need to deoptimize them for transport.

You give up a lot of convenience for this optimization. You should be absolutely sure your design requires it before using it.

reply

upvote

by cropcirclbureau10 hours ago|

[-]

Is this relevant for other databases? For postgres for example, which supports concurrent writers, wouldn't sequential keys lead to contention on the page at the frontier?

reply

upvote

by andersmurphy10 hours ago|

[-]

That's a good question. I don't know the answer. I will say, generally you can get higher write throughput with a single writer. Even more so if you're prepared to shard along boundaries where you don't need atomic transactions.

Contention and coordination are real killers, concurrent writes (that require coordination like postgres) often underdeliver.

reply

upvote

by dumbledorf18 hours ago|

[-]

Wait how is sqlite doing a million inserts a second?

reply

upvote

by JSR_FDED17 hours ago|

[-]

In batches

reply

upvote

by kg17 hours ago|

[-]

sqlite is really fast. I'm surprised it's only a million.

reply

upvote

by 16 hours ago|

[-]

deleted

reply

upvote

by andersmurphy11 hours ago|

[-]

It's running on an M1 mac with synchronous full. Wouldn't surprise me if it's possible to get higher numbers.

reply

upvote

by smitty1e17 hours ago|

[-]

':memory:'

https://sqlite.org/inmemorydb.html

reply

upvote

by KPGv213 hours ago|

[-]

Except this source code is not using :memory: The linked source code has

    (defonce db
      (d/init-db! "db/db.db"
        {:pool-size 4 :pragma {:synchronous "FULL"}}))

That's writing to disk.

reply

upvote

by andersmurphy11 hours ago|

[-]

Yes it's writing to disk (on a M1 mac which has terribly slow fsync). But, because of the transaction the fsync dance is done once per batch. Each row is the id + a 50 byte data blob.

There's only one index so there's no real write amplification. The numbers will go down as you add more data and indexes.

reply

upvote

by michaelcampbell6 hours ago|

[-]

How much time is `(random-uuid7-bytes)` taking?

reply

upvote

by u1hcw9nx8 minutes ago|

[-]

I can't believe I had to scroll down to this far to see someone making this point.

Also INSERT speed instead of SELECT? Typically most time is spend in SELECT or UPDATE.

reply

upvote

by knightops_dev22 minutes ago|

[-]

[flagged]

reply

upvote

by 41 minutes ago|

[-]

deleted

reply

upvote

by wood_spirit12 hours ago|

[-]

If you need (or want the convenience of) a uuid and the time of creation is not secret then use ulids eg uuid v7.

reply