undefined

upvote

points

by matja16 hours ago |

upvote

by da_chicken11 hours ago|

[-]

Yeah, I thought it was a strange comment, too. v7 is great when you explicitly need monotonicity, but encoded timestamps can expose information about your system. v4 is still very valid.

reply

upvote

by jandrewrogers3 hours ago|

[-]

I think "outdated" was a poor choice of words. It is a failure to meet application requirements, which has more to do with design than age. Every standardized UUID is expressly prohibited in some application contexts due to material deficiencies, including v4. That includes newer standards like v7 and v8.

In practice, most orgs with sufficiently large and complex data models use the term "UUID" to mean a pure 128-bit value that makes no reference to the UUID standard. It is not difficult to find yourself with a set of application requirements that cannot be satisfied with a standardized UUID.

The sophistication of our use case scenarios for UUIDs exceeds their original design assumptions. They don't readily support every operation you might want to do on a UUID.

reply

upvote

by zadikian15 hours ago|

[-]

Yeah v4 is the goto, and you only use something else if you have a very specific reason like needing rough ordering

reply

upvote

by jodleif14 hours ago|

[-]

Deterministic uuids is a very standard usecase

reply

upvote

by 8organicbits11 hours ago|

[-]

You're talking about the hash-based UUIDv3/v5? I haven't found examples of those being used, but I'm curious.

Using MD5 or 122 bits of a SHA1 hash seems questionable now that both algorithms have known collisions. Using 122 bits of a SHA2/3 seems pretty limited too. Maybe if you've got trusted inputs?

reply

upvote

by buffalobuffalo1 hours ago|

[-]

I use these a lot. My favorite use case is templates, especially ones that were not initially planned in the architecture.

Let's say i have some entity like an "organization" that has data that spans several different tables. I want to use that organization as a "parent" in such a way where i can clone them to create new "child" organizations structured the same way they are. I also want to periodically be able to pull changes from the parent organization down into the child organization.

If the primary keys for all tables involved are UUIDs, I can accomplish this very easily by mapping all IDs in the relevant tables `id => uuid5(id, childOrgId)`. This can be done to all join tables, foreign keys, etc. The end result is a perfect "child" clone of the organization with all data relations still in place. This data can be refreshed from the parent organization any time simply by repeating the process.

reply

upvote

by zadikian6 hours ago|

[-]

Common one is if you want two structs deemed "equivalent" based on a few fields to get the same ID, and you're only concerned about accidental collision. There are valid use cases for that, but I've also seen it misused often.

v7 rough ordering also helps as a PK in certain sharded DBs, while others want random, or nonsharded ones usually just serial int.

reply

upvote

by 8organicbits4 hours ago|

[-]

Have you seen UUIDv3/v5 used there though? I've seen lots of md5 historically and sha variants recently, but not the UUID approach.

reply

upvote

by eureka78 hours ago|

[-]

I remember using them in a massive SQL query that needed to generate a GIS data set from multiple tables with an ungodly amount of JOINs and sub-queries to achieve ID stability. Don't ask :p

For those ~~curious~~ worried, no, this was not a security sensitive context.

reply

upvote

by gzread12 hours ago|

[-]

If you want 128 bits of randomness why not use 128 bits of randomness? A random UUID presupposes the random number has to fit in UUID format.

reply

upvote

by da_chicken11 hours ago|

[-]

122 bits of randomness.

It's the same reason we use UTF-8. It's well supported. UUIDs are well supported by most languages and storage systems. You don't have to worry about endianness or serialization. It's not a thing you have to think about. It's already been solved and optimized.

reply

upvote

by gzread11 hours ago|

[-]

byte[16] is well supported by most languages and storage systems.

reply

upvote

by da_chicken10 hours ago|

[-]

Sure.

Now generate your random ID. Did you use a CSPRNG, or were your devs lazy and just used a PRNG? Are you doing that every time you're generating one of these IDs in any system that might need to communicate with your API? Or maybe they just generated one random number, and now they're adding 1 every time.

Now transfer it over a wire. Are you sure the way you're serializing it is how the remote system will deserialize it? Maybe you should use a string representation, since character transmission is a solved problem with UTF-8. OK, so who decides what that canonical representation is? How do we make it recognizable as an ID without looking like something that people should do arithmetic with?

It's not like random IDs were a new idea in 2002.

reply

upvote

by 10000truths9 hours ago|

[-]

None of these are rocket-science problems, they're just standardization issues. You build a library with your generate_id/serialize_id/deserialize_id functions that work with a wrapper type, and tell your devs to use that library. UUID libraries are exactly that, except backed by an RFC.

reply

upvote

by da_chicken5 hours ago|

[-]

Of course they're not rocket science. But, the question here is, "Why don't you use random 16 bytes instead of a UUIDv4?" It's not a question about rocket science. The answer is still, "Because UUIDv4 is still a better way to do it." The UUID standard solves the second and third tier problems and knock-on effects you don't think about until you've run a system for awhile, or until you start adding multiple information systems that need to interact with the same data.

But, using UUIDv4 shouldn't be rocket science, either. UUID support should be built in to a language intended for web applications, database applications, or business applications. That's why you're using Go or C# instead of C. And Go is somewhat focused on micro-service architectures. It's going to need to serialize and deserialize objects regularly.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by gzread10 hours ago|

[-]

How's your UUIDv4 generated?

> Are you sure the way you're serializing it is how the remote system will deserialize it?

It's 16 bytes. There's no serialization.

reply

upvote

by wredcoll9 hours ago|

[-]

What do they look like when I put it in a url?

reply

upvote

by 9 hours ago|

[-]

deleted

reply

upvote

by pphysch7 hours ago|

[-]

Use whatever encoding you want? Base64 is probably one of the most practical, but you're not obligated to use that.

reply

upvote

by bastawhiz5 hours ago|

[-]

UUIDs don't use base64

reply

upvote

by bastawhiz5 hours ago|

[-]

> There's no serialization.

Hex encoding with hyphens in the right spot isn't serialization?

reply

upvote

by 5 hours ago|

[-]

deleted

reply

upvote

by intelVISA7 hours ago|

[-]

Vibe endian

reply

upvote

by da_chicken15 minutes ago|

[-]

Schrodinger's complement

reply

upvote

by efilife10 hours ago|

[-]

You are really making it seem like a huge problem. Generate random bytes, serialize to a string and store in a db. Done

A downvote tells me nothing. Please tell me what I'm missing, maybe I could learn something

reply

upvote

by bastawhiz4 hours ago|

[-]

> serialize to a string and store in a db

Ah, here we are. If it's just bytes, why store it as a string? Sixteen bytes is just a 128-bit integer, don't waste the space. So now the DB needs to know how to convert your string back to an integer. And back to a string when you ask for it.

"Well why not just keep it as an integer?"

Sure, in which base? With leading zeroes as padding?

But now you also need to handle this in JavaScript, where you have to know to deserialize it to a Bigint or Buffer (or Uint8Array).

UUIDs just mean you don't need to do any of this crap yourself. It's already there and it already works. Everything everywhere speaks the same UUIDs.

reply

upvote

by TomatoCo7 hours ago|

[-]

You have to generate random bytes with sufficient entropy to avoid collisions and you have to have a consistent way to serialize it to a string. There's already a standard for this, it's called UUID.

reply

upvote

by hamburglar2 hours ago|

[-]

It’s really not that complicated a problem. Don’t worry, you’ll certainly be able to solve all the problems yourself as you encounter them. What you end up with will be functionally equivalent to a proper UUID and will only have cost you man-months of pain, but then you will be able to truly understand the benefit of not spending your effort on easy problems that someone solved before you.

reply

upvote

by zadikian6 hours ago|

[-]

It's not a huge problem. Uuid adds convenience over reinventing that wheel everywhere. And some of those wheels would use the wrong random or hash or encoding.

(Downvote wasn't me)

reply

upvote

by bootsmann15 hours ago|

[-]

Really? Doesn’t v4 locally make the inserts into the B-Tree pretty messy? I was taught to use v7 because it allows writes to be a lot faster due to memory efficient paging by the kernel (something you lose with v4 because the page of a subsequent write is entirely random).

reply

upvote

by sintax14 hours ago|

[-]

https://www.thenile.dev/blog/uuidv7#why-uuidv7 has some details: " UUID versions that are not time ordered, such as UUIDv4 (described in Section 5.4), have poor database-index locality. This means that new values created in succession are not close to each other in the index; thus, they require inserts to be performed at random locations. The resulting negative performance effects on the common structures used for this (B-tree and its variants) can be dramatic. ".

Also mentioned on HN https://news.ycombinator.com/item?id=45323008

reply

upvote

by ownagefool12 hours ago|

[-]

In more practical terms:-

1. Users - your users table may not benefit by being ordered by created_at ( or uuid7 ) index because whether or not you need to query that data is tied to the users activity rather than when they first on-boarded.

2 Orders - The majority of your queries on recent orders or historical reporting type query which should benefit for a created_at ( or uuidv7 ) index.

Obviously the argument is then you're leaking data in the key, but my personal take is this is over stated. You might not want to tell people how old a User is, but you're pretty much always going to tell them how old an Order is.

reply

upvote

by da_chicken11 hours ago|

[-]

It's memory and disk paging both.

There's also a hot spot problem with databases. That's the performance problem with autoincrement integers. If you are always writing to the same page on disk, then every write has to lock the same page.

Uuidv7 is a trade off between a messy b-tree (page splits) and a write page hot spot (latch contention). It's always on the right side of the b-tree, but it's spread out more to avoid hot spots.

That still doesn't mean you should always use v7. It does reversibly encode a timestamp, and it could be used to determine the rate that ids are generated (analogous to the German tank problem). If the uuidv7 is monotonic, then it's worse for this issue.

reply

upvote

by out_of_protocol14 hours ago|

[-]

v7 exposes creation date, and maybe you don't want that. So, depends on use-case

reply

upvote

by 1f60c12 hours ago|

[-]

I think I read something once about using v7 internally and exposing v4 in your API.

reply

upvote

by talkin8 hours ago|

[-]

Or even an autoincrement int primary key internally. Depending on your scale and env etc, but still fits enough use cases.

reply

upvote

by matja14 hours ago|

[-]

In distributed databases I've worked with, there's usually something like a B-tree per key range, but there can be thousands of key ranges distributed over all the nodes in the cluster in parallel, each handling modifications in a LSM. The goal there is to distribute the storage and processing over all nodes equally, and that's why predictable/clustered IDs fail to do so well. That's different to the Postgres/MySQL scenario where you have one large B-tree per index.

reply

upvote

by lijok9 hours ago|

[-]

Have you considered using two uuids for more randomness

reply

upvote

by pclmulqdq12 hours ago|

[-]

I believe current official guidance if you want a lot of random data is to use v8, the "user-defined" UUID. The use of v4 is strictly less flexible here.

reply

upvote

by 8organicbits11 hours ago|

[-]

No, UUIDv8 offers 122 bits for vendor specific or experimental use cases. If you fill those bits randomly, you get the same amount of randomness as a v4. The spec is explicit that it does not replace v4 for random data use case.

> To be clear, UUIDv8 is not a replacement for UUIDv4 (Section 5.4) where all 122 extra bits are filled with random data.

https://www.rfc-editor.org/rfc/rfc9562.html#section-5.8-2

reply

upvote

by pclmulqdq8 hours ago|

[-]

Yes, vendor-specific data can be 100% random.

reply

upvote

by 8organicbits6 hours ago|

[-]

It can be, but you should prefer UUIDv4 if you do that. One problem is that UUIDv8 does not promise uniqueness.

> UUIDv8's uniqueness will be implementation specific and MUST NOT be assumed.

Here's a spec compliant UUIDv8 implementation I made that doesn't produce unique IDs: https://github.com/robalexdev/uuidv8-xkcd-221

So, given a spec-compliant UUIDv4 you can assume it is unique, but you'd need out-of-band information to make the same assumption about a UUIDv8.

I wrote much more in a blog post: https://alexsci.com/blog/uuid-oops/

reply

upvote

by arccy14 hours ago|

[-]

[flagged]

reply