Silkenweb Example: Hackernews Clone

RFC 9562: Universally Unique IDentifiers (May 2024)

48 points by htunnicliff 1 year ago | 43 comments

londons_explore 1 year ago
> UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds
This just seems to be a way of creating a huge class of subtle bugs. Now, when two things happen to be created in the same millisecond, they may or may not be monotonically increasing.
Plenty of systems will end up accidentally depending on the ordering of the UUID's being the same order the UUID's were generated in. And that will hold true till the system hits production and suddenly there is enough load for that not to be true for a handful of records and the whole system fails.
- vhcr 1 year ago
  Monotonicity is addressed in section 6.2, but it's optional.
swyx 1 year ago
i collect a list of UUID implementations and concerns to think thru here https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...
htunnicliff 1 year ago
TL;DR: Several new UUID versions have been standardized
UUIDv5 is meant for generating UUIDs from "names" that are drawn from, and unique within, some "namespace" as per Section 6.5.
UUIDv6 is a field-compatible version of UUIDv1 (Section 5.1), reordered for improved DB locality. It is expected that UUIDv6 will primarily be implemented in contexts where UUIDv1 is used.
UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded. Generally, UUIDv7 has improved entropy characteristics over UUIDv1 (Section 5.1) or UUIDv6 (Section 5.6).
UUIDv8 provides a format for experimental or vendor-specific use cases. The only requirement is that the variant and version bits MUST be set as defined in Sections 4.1 and 4.2. UUIDv8's uniqueness will be implementation specific and MUST NOT be assumed.
The only explicitly defined bits are those of the version and variant fields, leaving 122 bits for implementation-specific UUIDs. To be clear, UUIDv8 is not a replacement for UUIDv4 (Section 5.4) where all 122 extra bits are filled with random data.
Background for the changes:
Many things have changed in the time since UUIDs were originally created. Modern applications have a need to create and utilize UUIDs as the primary identifier for a variety of different items in complex computational systems, including but not limited to database keys, file names, machine or system names, and identifiers for event-driven transactions.
pspeter3 1 year ago
I'm curious why they specify the UUID must have dashes in string format. It makes the UUID difficult to select with a double click.
- Two4 1 year ago
  As with IP addresses, UX/DX is not the primary concern
- shrimp_emoji 1 year ago
  Try a triple-click.
- azulster 1 year ago
  probably because the dashes have semantic meaning
- newprint 1 year ago
  you do understand that they existed way before the mouse and button became the norm ?
  - SahAssar 1 year ago
    I think widespread mouse usage and early uuid usage was similar in time, 1980's to early 1990's.
    Not sure when the "doucle-click to select" UI paradigm became common though.
deathanatos 1 year ago
> Some UUID implementations, such as those found in Python and Microsoft, will output UUID with the string format, including dashes, enclosed in curly braces.
No … Python doesn't emit them enclosed in curly braces?
```
  >>> str(uuid.uuid4())
  '593a2ffb-eafc-484a-9a90-93bc91805651'
```
LegionMammal978 1 year ago
> UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded.
That seems like a rather vague way of addressing leap seconds for UUIDv7. For positive leap seconds, an 'exclusion' of that second would suggest that the millisecond counter is halted until the leap second is over, which doesn't seem ideal for monotonicity. And an 'exclusion' of a negative leap second hardly makes any conventional sense at all, with regard to the millisecond counter.
Contrast with the timestamp of UUIDv1/v6, where positive leap seconds can just be handled by incrementing the clock sequence.
- fanf2 1 year ago
  That’s the normal way IETF RFCs describe unix seconds since the epoch, though there ought to be a normative reference to https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
  - LegionMammal978 1 year ago
    The problem with "seconds since the Epoch" is that if you naively add milliseconds, it is no longer monotonic.
    Since 2016-12-31T23:59:60Z is 1483228800 seconds since the Epoch, and 2017-01-01T00:00:00Z is also 1483228800 seconds since the Epoch, that means that 2016-12-21T23:59:60.xxxZ would have the same timestamp as 2017-01-01T00:00:00.xxxZ, for all xxx.
    This corresponds to the counter jumping backward 1000 milliseconds at 2017-01-01T00:00:00.000Z.
    - cfreksen 1 year ago
      I would say that the instance (or second long interval) in time that we name 2016-12-31T23:59:60Z is 1483228836 seconds after 1970-01-01T00:00:00Z, and that 2017-01-01T00:00:00Z is 1483228837 seconds after 1970-01-01T00:00:00Z [0].
      The key here is that I use "seconds" to mean a fixed duration of time, like how long it takes light to travel 299792458 meters in a vacuum[1], and this version of seconds is independent of Earth orbiting the Sun, or the Earth spinning or anything like that[2]. If I understand you correctly, you use "seconds" more akin to how I use "days in a year": Most years have 365 days, but when certain dates starts to drift too far from some astronomical phenomenon we like to be aligned with (e.g. that the Northern Hemisphere has Summer Solstice around the 21st of June) we insert an extra day in some years (about every 4th year).
      I haven't read RFC 9562 in detail, but if you use my version of "seconds" then "seconds since the Epoch" is a meaningful and monotonically increasing sequence. I suspect that some of the other commentors in this thread use this version of "seconds" and that some of the confusion/disagreement stems from this difference in definition.
      The paragraph in Section 6.1 titled "Altering, Fuzzing, or Smearing" also seems relevant:
      > Implementations MAY alter the actual timestamp. Some examples include ..., 2) handle leap seconds ... > This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time.
      [0] Please forgive any off-by-one errors I might have made.
      [1] I know that the SI definition between meters and seconds is the other way around, but I think my point is clearer this way.
      [2] I ignore relativity as I don't think it is relevant here.
- anamexis 1 year ago
  There will not be any leap seconds after 2035, and very likely there will never be any negative leap seconds.
  - LegionMammal978 1 year ago
    That's plenty of time for the CGPM to change its mind, or to implement some other mechanism to bound the UT1 − UTC difference. It will eventually be an issue in any case, since it's not like they decided to let the difference grow without bound.
- wrs 1 year ago
  I interpreted it to mean the timer is monotonic and ignores leap seconds completely. It does make it easy to implement wrong if your most convenient time API does implement leap seconds. (I don’t see why this would have anything to do with the millisecond timer? Leap seconds happen on the second.)
  - LegionMammal978 1 year ago
    Unix timestamps are not monotonic when a positive leap second is applied: the next day must always start at a multiple of 86400 seconds, even if the UTC day is 86401 seconds long. Unless some part of the day is smeared, the timestamp must be set back at some point. So either the UUIDv7 timer is not monotonic, or it does not align with Unix timestamps.
    As for the millisecond timer, recall that a positive leap second lasts for 1000 milliseconds. So to 'exclude' the leap second, by one interpretation, would be to exclude each of those milliseconds individually as they arise; in other words, to halt the timer during the leap second.
    - anamexis 1 year ago
      The way I read it, they don't claim to align with Unix timestamps. They claim being aligned with the same source time.
ComplexSystems 1 year ago
Surprising we're using 128 bits - some back of the napkin math tells me that may not be enough to avoid collisions...
- Spivak 1 year ago
  Depends on your problem domain. You can be Twitter/Discord sized and get away with 64 bits. When you start dedicating parts of your UUID to a timestamp the possibility of collisions does go way up since now a significant chunk of the UUID will be the same for everyone. But when you deploy this variant you aren't trying to make globally unique ids anymore, you're trying to make application unique ids. You are sill very unlikely to not also have a globally unique id because 128 bits gives a lot of room to play around.
- WorldMaker 1 year ago
  For hash functions, maybe not anymore, given the birthday paradox/pigeon-hole principle and other math problems in bucketing inputs versus the attack patterns for breaking hash functions and causing intentional collisions. For mostly purely random entropy in uses like UUID (and IPv6) the classic answer is that it is still more overall space than "atoms in the visible universe".
- AaronFriel 1 year ago
  Care to share your math? My understanding of the birthday paradox is that it is astoundingly unlikely.
  - ComplexSystems 1 year ago
    It's just about on the cusp. We would need to generate 1.1774sqrt(2^128) UUID's before getting a collision with 51% probability. That's about 2.17 10^19 total UUIDs.
    The real question is how many UUIDs are generated per second around the world. This RFC suggests using them for automated processes, transactions, etc and generally seems to view them as an inexhaustible resource. If humanity collectively generates 1 trillion per second we can expect to see a 51% chance of collision in 8 months; if it's 100 billion it'd be 10 years, and if it's only 10 billion it'd be 100 years. I would expect even just one single computer with a modest GPU could get in the ballpark of these numbers if it wanted to just spawn UUIDs all day, let alone a huge server farm using them as part of some automated process.
    - AaronFriel 1 year ago
      True, as universally unique identifiers, 128 (less a few) bits is not enough. You're talking about humanity generating 505 exabytes per year of just UUIDs. That won't happen any time soon.
    - t0mas88 1 year ago
      But not all UUIDS are going into the same "pool". There is no problem if your GPU generates a collision with one of my database identifies, since I only care about my identifiers being unique in my system.
      For comparison, before UUIDs most databases were using auto increment integers. That means nearly everyone had id 1, 2 etc in use. Still not a problem.
- free_bip 1 year ago
  The time field ensures that collisions cannot occur until at minimum the time field rolls over.
  - debatem1 1 year ago
    Having been badly bitten by the timestamp equivalent of eating gum you scraped off the bottom of a desk just let me add that there is an implicit "if your time source is good/non-adversarial" assumption here.
    Friends don't let friends use malicious clocks!
- 1 year ago
1 year ago
cachvico 1 year ago
HNGPT, please summarize the important changes?
- jcrites 1 year ago
  They seem better geared for usage in databases as primary keys, specifically UUID versions 6 and onwards:
  > Motivation. One area in which UUIDs have gained popularity is database keys. This stems from the increasingly distributed nature of modern applications. In such cases, "auto-increment" schemes that are often used by databases do not work well: the effort required to coordinate sequential numeric identifiers across a network can easily become a burden. The fact that UUIDs can be used to create unique, reasonably short values in distributed systems without requiring coordination makes them a good alternative, but UUID versions 1-5, which were originally defined by [RFC4122], lack certain other desirable characteristics [...]
  > UUIDv6 is a field-compatible version of UUIDv1 (Section 5.1), reordered for improved DB locality. It is expected that UUIDv6 will primarily be implemented in contexts where UUIDv1 is used. Systems that do not involve legacy UUIDv1 SHOULD use UUIDv7 (Section 5.7) instead.
  > Instead of splitting the timestamp into the low, mid, and high sections from UUIDv1, UUIDv6 changes this sequence so timestamp bytes are stored from most to least significant. That is, given a 60-bit timestamp value as specified for UUIDv1 in Section 5.1, for UUIDv6 the first 48 most significant bits are stored first, followed by the 4-bit version (same position), followed by the remaining 12 bits of the original 60-bit timestamp. [...]
  > UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded. Generally, UUIDv7 has improved entropy characteristics over UUIDv1 (Section 5.1) or UUIDv6 (Section 5.6).
1 year ago
posting_mess 1 year ago
[flagged]
- jcrites 1 year ago
  The problem that this standard solves isn't a math problem. It's an engineering problem of defining (adding) UUID formats that are suitable for use in database keys (and some other things). Previous proposals had disadvantages for the use-case.
  This is discussed in the "Update Motivation" section of the document: https://www.rfc-editor.org/rfc/rfc9562.html#name-update-moti...
- sedatk 1 year ago
  > but we cant come up with a decent UUID scheme
  maybe because we can’t come up with an unambiguous definition of “decent.”
  - posting_mess 1 year ago
    We do and we dont frequently agree on whats "decent".
    Most routers implement a set of security standards/protocols for VPN's that are "decent" and make the play nicely with each other.
    The "Redis protocol" gets re-implemented frequently because its "decent" and useful to many vendors.
    I cant speak for "encryption" but there has to be numerous implementations of various algorithms.
    And this is true for many other protocols.
    UUID seems mathematically "provable" or "verifiable", why are we wasting time on needless "wrong" / "non decent" implementations?