Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is actually a very deep and interesting topic. Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place. So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.

The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.





> Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place.

The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.

> So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.

Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

A fully random primary key takes into account that things change - since it's not embedding any real-world information. That said I also don't think there's much issue with embedding creation time in the UUID for performance reasons, as the article is suggesting.


> You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

This is the key point, I think. Searching is not the same as identifying.


> Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

So what is such an identifier for? Is it only for some technical purposes (like replication etc.)?

Why bother with UUID at all then for internal identifiers? Sequence number should be enough.


"Internal" is a blurry boundary, though - you pick integer sequence numbers and then years on an API gets bolted on to your purely internal database and now your system is vulnerable to enumeration attacks. Does a vendor system where you reference some of your internal data count as "internal"? Is UID 1 the system user that was originally used to provision the system? Better try and attack that one specifically... the list goes on.

UUIDs or other similarly randomized IDs are useful because they don't include any ordering information or imply anything about significance, which is a very safe default despite the performance hits.

There certainly are reasons to avoid them and the article we're commenting on names some good ones, at scale. But I'd argue that if you have those problems you likely have the resources and experience to mitigate the risks, and that true randomly-derived IDs are a safer default for most new systems if you don't have one of the very specific reasons to avoid them.


> "Internal" is a blurry boundary, though

Not for me :)

"Internal" means "not exposed outside the database" (that includes applications and any other external systems)


Internal means "not exposed outside some boundary". For most people, this boundary encompasses something larger than a single database, and this boundary can change.

UUIDs are good for creating entries concurrently where coordinating between distributed systems may be difficult.

May also be that you don't want to leak information like how many orders are being made, as could be inferred from a `/fetch_order?id=123` API with sequential IDs.

Sequential primary keys are still commonly used though - it's a scenario-dependant trade-off.


If you expose the identifier outside the database, it is no longer "internal".

Given the chain was:

> > Using a random UUID as primary key does not mean users have to memorize that UUID. [...]

> So what is such an identifier for? [...] Why bother with UUID at all then for internal identifiers?

The context, that you're questioning what they're useful for if not for use by the user, suggests that "internal" means the complement. That is, IDs used by your company and software, and maybe even API calls the website makes, but not anything the user has to know.

Otherwise, if "internal" was intended to mean something stricter (only used by a single non-distributed database, not accessed by any applications using the database, and never will be in the future), then my response is just that many IDs are neither internal in this sense nor intended to be memorized/saved by the user.


> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.

E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.

It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).


> I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.

If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The whole discussion is about externally visible identifiers (ie. identifiers visible to external software, potentially used as a persistent long-term reference to your data).

> E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.

Introducing surrogate keys (regardless of whether UUIDs or anything else) does not solve any problem in reality. When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me. Surrogate keys don't help here at all. You either have to be able to solve this issue in the database or you need to have an oracle (ie. a person) that must decide ad-hoc what piece of data is identified by the information I provided.

The key issue here is that you try to model identifiable "entities" in your data model, while it is much better to model "captured information".

So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.


> So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

This is so needlessly complex that you contradicted yourself immediately. You claim there is no “person” identified but immediately say you have information “about a person”. The fact that you can assert that the information is about a person means that you have identified a person.

Clearly tying data to the person makes things so much easier. I feel like attempting to do what you propose is begging to mess up GDPR erasure.

> “So I got a request from a John Doe to erase all data we recorded for them. They identified themselves by mailing address and current phone number. So we deleted all data we recorded for that phone number.”

> “Did you delete data recorded for their previous phone number?”

> “Uh, what?”

The stubborn refusal to create a persistent identifier makes your job harder, not easier.


> If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.

> When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me.

How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key? All 3 of those are pretty routine to change. I've changed my email and phone number a few times, and if I got married my name might change as well.

> Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

I think that spirals into way more complexity than you're thinking. You get those timestamped records about "we got info about person named Y with phone number Z", and then person Y changes their phone number. Now you're going to start getting records from person named Y with phone number A, but it's the same account. You can record "person named Y changed their phone number from Z to A", and now your queries have to be temporal (i.e. know when that person had what phone number). You could back-update all the records to change Z to A, but that breaks some things (e.g. SMS logs will show that you sent a text to a number that you didn't send it to).

Worse yet, neither names nor phone numbers uniquely identify a person, so it's entirely possible to have records saying "person named Y and phone number Z" that refer to different people if a phone number transfers from a John Doe to a different person named John Doe.

I don't doubt you could do it, but I can't imagine it being worth it. I can't imagine a way to do it that doesn't either a) break records by backdating information that wasn't true back then, or b) require repeated/recursive querying that will hammer the DB (e.g. if someone has had 5 phone numbers, how do you get all the numbers they've had without pulling the latest one to find the last change, and then the one before that, and etc). Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".


> The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.

So we are talking about "external" keys (ie. visible outside the database). We are back to square one: externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).

It does not matter if they are random or not.

> How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key?

And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.

If you keep information about the time information was captured, you can at least ask me "what was your phone number last time we've interacted and when was it?"

> I think that spirals into way more complexity than you're thinking.

This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.

DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.

Event sourcing is a somewhat convoluted way to attack this problem as well.

> Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".

Sure, but those queries are useless if you just don't know user_id.


> externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).

All IDs are detached from the real world. That’s the core premise of an ID. It’s a bit of information that is unique to someone or something, but it is not that person or thing.

Your phone number is a random number that the phone company points to your phone. Your house has a street name and number that someone decided to assign to it. Your email is an arbitrary label that is used to route mail to some server. Your social security number is some arbitrary id the government assigned you. Even your name is an arbitrary label that your parents assigned to you.

Fundamentally your notion that there is some “real world” identifier is not true. No identifiers are real. They are all abstractions and the question is not whether the “real” identifier is better than a “fake” one, but whether an existing identifier is better than one you create for your system.

I would argue that in most cases, creating your own ID is going to save you headaches in the long term. If you bake SSN or Email or Phone Number throughout your system, you will make it a pain for yourself when inevitably someone needs to change their ID and you have cascading updates needed throughout your entire system.


In my country, citizens have an "ID" (a UUID, which most people don't know the value of!) and a social security number which they know - which has all the problems described above). While the social security number may indeed change (doubly assigned numbers, gender reassignment, etc.), the ID needn't change, since it's the same physical person.

Public sector it-systems may use the ID and rely on it not changing.

Private sector it-systems can't look up people by their ID, but only use the social security number for comparisons and lookups, e.g. for wiping records in GDPR "right to be forgotten"-situations. Social security numbers are sortof-useful for that purpose because they are printed on passports, driver's licenses and the like. And they are a problem w.r.t. identity theft, and shouldn't ever be used as an authenticator (we have better methods for that). The person ID isn't useful for identity theft, since it's only used between authorized contexts (disregarding Byzantine scenarios with rogue public-sector actors!). You can't social engineer your way to personal data using that ID unless (safe a few movie-plot scenarios).

So what is internal in this case? The person id is indeed internal to the public sector's it-systems, and useful for tracking information between agencies. They're not useful for Bob or Alice. (They ARE useful for Eve, or other malicious inside actors, but that's a different story, which realistically does require a much higher level of digital maturity across the entire society)


> It does not matter if they are random or not.

Again, sometimes it does, the article lists a few of them. Making it harder to scrape, unifying across databases that share a keyspace, etc.

> And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.

That surrogate key is linked to literally every other record in the database I have for you. There are near infinite ways for me to convert something you know to that surrogate key. Give me a transaction ID, give me a phone number/email and the rough date you signed up, hell give me your IP address and I can probably work back to a user ID from auth logs.

The point isn't that you know the surrogate key, it's that _everything_ is linked to that surrogate key so if you can give me literally any info you know I can work back to the internal ID.

> This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.

Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.

Mine is just something like:

SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;

Could be a single query, but you get the idea.

> DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.

This kind of proves my point. If you need bi-temporal extensions and materialized views to tell a user what their email address is from a transaction ID, I cannot imagine the absolute mountain of SQL it takes to do something more complicated like calculating revenue per user.


I am not sure you are arguing against my claims or not :)

I am not arguing against surrogate keys in general. They are obviously very useful _internally_ to introduce a level of indirection. But if they are used _internally_ then it doesn't really matter if they are UUIDs or sequence numbers or whatever - it is just an implementation detail.

What I claim is that surrogate keys are problematic as _externally visible_ identifiers.

> Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.

> Mine is just something like:

> SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;

I think you are missing the actual problem I am talking about: where does the user take the transaction ID from? Do you expect the users to remember all transaction IDs your system ever generated for them? How would they know which transaction ID to ask about? Are they expected to keep some metadata that would allow them to identify transaction IDs? But if there is metadata that enables identification of transaction IDs then why not use it instead of transaction ID in the first place?


> I think you are missing the actual problem I am talking about: where does the user take the transaction ID from? Do you expect the users to remember all transaction IDs your system ever generated for them? How would they know which transaction ID to ask about? Are they expected to keep some metadata that would allow them to identify transaction IDs? But if there is metadata that enables identification of transaction IDs then why not use it instead of transaction ID in the first place?

Your notion that you can avoid sharing internal ids is technically true, but that didn’t mean it’s a good idea. You’re trying force a philosophical viewpoint and disregarding practical concerns, many of which people have already pointed out.

But to answer your question, yes, your customer will probably have some notion of a transaction id. This is why everyone gives you invoice numbers or order numbers. These are indexes back into some system. Because the alternative is that your customer calls you up and says “so I bought this thing last week, maybe on Tuesday?” And it’s most likely possible to eventually find the transaction this way, but it’s a pain and usually requires human investigation to find the right transaction. It’s wasteful for you and the customer to do business this way if you don’t have to.


> Your notion that you can avoid sharing internal ids is technically true, but that didn’t mean it’s a good idea. You’re trying force a philosophical viewpoint and disregarding practical concerns, many of which people have already pointed out.

What some call "philosophical viewpoint" I call "essential complexity" :)

> But to answer your question, yes, your customer will probably have some notion of a transaction id. This is why everyone gives you invoice numbers or order numbers.

We are in agreement here: externally visible identifiers are needed for many reasons (mostly technical). The discussion is not about that though but about what information should be included in these identifiers.

> This is why everyone gives you invoice numbers or order numbers.

And there are good reasons why invoice or order numbers are not randomly generated strings but contain information about the invoices and orders they identify.

My claim is that externally visible identifiers should possess a few characteristics:

* should be based on the data they identify (not detached from it)

* should be easy to remember (and that means they should be as short as possible, they should be easy to construct by a human from the data itself - so they cannot be hashes of data)

* should be versioned (ie. they should contain information somehow identifying the actual algorithm used to construct them)

* should be easy to index by database engines (that is highly db implementation dependent unfortunately)

* can be meaningfully sortable (that is not strictly a requirement but nice to have)

Coming up with an identifier having these characteristics is not trivial but is going to pay off in the long run (ie. is essential complexity).


Much of this is not essential complexity, but accidental complexity.

* Based on the data they identify - This is a minefield of accidental complexity. Data changes and needs to be redacted for GDPR and other data laws. What do you do when someone demands you delete all personally identifiable data but you’ve burned it into invoice ids that you need to retain for other legal reasons? This is also begging for collisions and very much at odds with making IDs short.

* easy to remember - This is a nice to have. Short is convenient for sharing on the phone. Memorable didn’t matter much. I don’t remember any invoice number I’ve ever received.

* versioned - Versioning is only interesting because you’re trying to derive from real data. Again, accidental complexity.

* easy to index - Sure.

* sortable - Nice to have at best.


> * Based on the data they identify

> * easy to remember

(which means human readable and related to the actual information which makes them easier to remember)

These actually are the most important features.

Example: transaction references not related to the actual subject of the transaction (ie. what is being paid for) is enabler for MITM scam schemes.

> Short is convenient

Nah. Short is crucial for identifiers to be effective for computers to handle (memory and CPU efficiency). Otherwise we wouldn't need any identifiers and would just pass raw data around.

> * versioned - Versioning is only interesting because you’re trying to derive from real data.

Nah. Even UUID formats contain version information.

> * easy to index - Sure.

> * sortable - Nice to have at best.

These are directly related (and in the context of UUIDv4 vs UUIDv7 discussion sortable is not enough - we also want them to be "close" to each other when generating so that they can be indexed efficiently)


> These actually are the most important features.

You keep saying that but you have provided virtually no evidence in support of this. This is why I called your claim philosophical. You are asserting this as fact and arguing from that standpoint rather than considering what is the best based on actual requirements and trade offs.

> Example: transaction references not related to the actual subject of the transaction (ie. what is being paid for) is enabler for MITM scam schemes.

I don’t see how this is true. If anything transaction references based on the actual subject would make scamming slightly easier because a scammer can glean information from the reference.

I’m going to stop here, though. I don’t see that this is going to converge on any shared agreement.

Take care. And if you celebrate the holidays, happy holidays, too.


> I don’t see how this is true.

There is a Bitcoin seller B, a thieve T and a victim V.

T proposes to buy Bitcoin from B. T offers a new iPhone for a very low price to unsuspecting V. V agrees to buy it. B gives T account details and transaction reference so that T can transfer money to B's account. T gives these details to V. V transfers the money. B transfers Bitcoin to T. T disappears.

If only transaction reference contained information that the transfer is about buying Bitcoin, V would have never paid the money.

The scheme was quite common in UK because banks did not like Bitcoin so Bitcoin sellers and buyers avoided referencing it in bank transfers.


You’re arguing that in this circumstance, the bitcoin seller should produce an ID for the payer that exposes the purchase contents.

Firstly, I am extremely doubtful that this would actually prevent the issue. A wary buyer would not agree to transfer money account to account like this to pay for a cell phone in the first place. Only gullible people would engage in this scam, and I am doubtful that they would question the transaction ID deeply. “Hey what is this bitcoin thing?” “Oh, don’t worry about it. That’s just internal for our tracking purposes. Do you want me to throw in a free phone case too?!”

Secondly, this seems like a massive privacy concern. Is someone purchasing sex toys supposed to use a transaction ID like purpledildo656 and expose what they are buying to the bank?

I’m sympathetic to people who get scammed, but I don’t think your transaction IDs solve this problem. People have been getting scammed like this forever. “Hey, just send the $600 via Western Union and I’ll totally put your phone in the mail for you tomorrow.”

This isn’t an ID anyway. What you are really asking for is to mandate that the contents of the purchase be burned into the transaction from the seller all the way to the buyer through the bank. I think that’s a terrible idea because of privacy concerns, but regardless, it’s not an ID. This would be much better expressed as a different form of metadata.


> Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place.

The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

There isn't 'another' - there's just one. The surrogate key. The other pieces of information you're describing are not the means of indexing the data. They are the pieces of data you wish to retrieve.


Any piece of information that can be used to retrieve something using this index has to be available "outside" your database - ie. to issue a query "give me piece of information identified by X" you have to know X first. If X is only available in your index then you must have another index to retrieve X based on some externally available piece of information Y. And then X becomes useless as an identifier - it just adds a level of indirection that does not solve any information retrieval problem.

That's my whole point: either X becomes a "real world artifact" or it is useless as identifier.


That's not really how data is requested. Most of these identifiers are foreign keys - they exist in a larger object graph. Most systems of records are too large for people to associate surrogate keys to anything meaningful - they can easily have hundreds of billions of records.

Rather, users traverse that through that object graph, narrowing a range of keys of interest.

This hacker news article was given a surrogate key, 46272487. From that, you can determine what it links to, the name/date/author of the submission, comments, etc.

46272487 means absolutely nothing to anybody involved. But if you wanted to see submissions from user pil0u, or submissions submissions on 2025-12-15, or submissions pertaining to UUID, 46272487 would in that in that result set. Once 46272487 joins out to all of its other tables, you can populate a list that includes their user name, title, domain, etc.

Do not encode identifying information in unique identifiers! The entire world of software is built on surrogate keys and they work wonderfully.


> This hacker news article was given a surrogate key, 46272487. From that, you can determine what it links to, the name/date/author of the submission, comments, etc.

> Do not encode identifying information in unique identifiers! The entire world of software is built on surrogate keys and they work wonderfully.

The amount of manual work required to manage duplicates is in no small part the result of not thinking enough about the identifiers and simply slapping surrogate keys on the data.


Identifier is just "a piece of common token system can use to operate on same entity.

You need it. Because it's maybe one lone unchangeable thing. Taking person for example: * date of birth can be changed, if there was error and correction in documents * any and near all of existing physical characteristics can change over time, either due to brain things (deciding to change gender), aging, or accidents (fingerprints no longer apply if you burnt your skin enough) * DNA might be good enough, but that's one fucking long identifier to share and one hard to validate in field.

So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.


You can't take into account the fact that things change when you don't know what those changes might be. You might end up needing to either rebuild a new database, have some painful migration, or support two codepaths to work with both types of keys.

Network protocol designers know better and by default embed protocol version number in message format spec.

I guess you can assign 3-4 bits for identifier version number as well.

And yes - for long living data dealing with compatibility issues is inevitable so you have to take that into account from the very beginning.


when I designed network protocols this is exactly what I did. I also did so in file formats had to create. But a database primary kea is not somewhere where that can be easily done.

You can’t design something by trying to anticipate all future changes. things will change and break.

In my personal design sense, I have found keeping away generality actually helps my code last longer (based on more concrete ideas) and easier to change when those days come.


In my experience, virtually every time I bake concrete data into identifiers I end up regretting it. This isn’t a case of trying to predict all possible future changes. It’s a case of trying to not repeat the exact same mistake again.

I don’t disagree with that, I’m disagreeing with this comment that we can’t make protocol or data decisions that might change.

I misunderstood then. I interpreted your comment to say that you eschew generalization (e.g. uuids) in favor of concrete data (e.g. names, email addresses) for ids in your designs.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: