About a month ago I gave a talk at software development conference and made a serious mistake when answering one of the follow-up questions. The question was: “what problems do you see with client-generated IDs?”.
To understand the entire context of this question and see my (wrong) answer, you can watch the full talk here. In short: I mentioned a post by Dan Lew where he discussed the challenges involved in managing two sets of identifiers and swapping between them (client-side identifiers and server-side identifiers). I stated that this is an intrinsic challenge of fully offline-capable applications, but one of the attendees disagreed and asked me to clarify my thoughts on this subject. And then I gave completely wrong answer.
Now, I don’t mind being wrong. In fact, I like being wrong because then I can learn something new. Usually it’s enough to just be open-minded and ready to admit that you were wrong, but, in this case, my mistake was captured on video and can potentially mislead others. Therefore, I’d like to go the extra-mile and try to compensate for that with this article.
So, in this post I’m going to go as deep as I can manage into the world of identifiers, unique identifiers and the trade-offs between client-generated and server-generated identifiers.
The Importance of Identifiers
The concept of identifier is probably one of the oldest and most important abstractions ever invented by humankind. This abstraction allows you to say “John is a great guy” and be understood, even if John isn’t around and you can’t point your finger at him. “John” is an identifier assigned to a specific person and you can use this identifier to praise John.
In software, identifiers are paramount. Even pointers, which on the first sight seem to be analogous to finger pointing, are identifiers. And it’s only logical because the prominence of identifiers pertains the entire technology stack, all the way down to hardware level where identifiers designate registers and memory locations.
Therefore, identifiers are at least as important in software as our names are in the real world.
The Importance of Unique Identifiers
The issues with identifiers is that they can be tricky to interpret.
For example, there might be several John’s around and the person you talk to won’t understand to which one of them you refer when you say “John”. This might be not that important if you just want to praise John. However, this complication might become crucial if, for example, you want to accuse John of a crime. In the latter case, it’s important to make sure that all involved parties agree on the meaning of identifier “John”, or else serious harm can be done.
When there is a need to minimize the chance of mistakes in identification, you generally add additional information and identifiers to narrow down the domain of possible interpretations. When the stakes are really high, you might even be required to meet the person, point a finger to them and explicitly state that that’s the “John” you refer to.
However, such a process of precise identification is very cumbersome, error prone and doesn’t scale.
Furthermore, John can move to a different location, change his name, his appearance and even his gender, but we, as society, still want to be able to track that person over the course of their entire life. And I’m not talking about some conspiratorial tracking, but about the mundane mechanics of our system. For example, if John was a careless driver in the past and lost his license, we want to make sure that he won’t be able to get a new one any time soon.
That’s when a need for a unique identifier arises. We want to have an identifier that unambiguously refers to one specific John. Such an identifier won’t be subject to multiple interpretations and any party that reads it will attribute it to the same person.
As you probably know, practically all the information is stored in computers nowadays. Therefore, the need for unique identifiers translates naturally into the software world. Whenever the identity of a piece of data should be derived from its history and not from its current properties, that piece of data must have a unique identifier.
What’s the most important property of unique identifiers? Well, it’s the fact that they are unique, because we build our systems under assumption that we can unambiguously resolve each specific identifier to a single piece of data. Violate this assumption and all bets about functionality and reliability of software systems are off.
Therefore, generation of identifiers is very delicate and important process that requires great care and attention to details. That’s what we are going to discuss in the rest of this post.
In the following sections, whenever I say identifier or ID I refer to unique identifiers.
Now, identifiers don’t appear out of thin air, but need to be generated somewhere. Most commonly, they are generated on applications’ backend servers. These identifiers are called “server-generated” IDs.
Advantages of Server-Generated IDs
The biggest advantage in generating IDs on the server is that it’s relatively simple to ensure that these IDs will be unique.
In simple cases, a single physical server can generate all IDs in the application. In more complex systems, multiple servers might need to handle this task, but, since all of them are under your full control, you can be relatively confident that you know what’s going on and there will be no surprises.
I’m saying “relative” because server-side generation of IDs can still be a very complex task on its own. It’s only when you compare it with other options that it becomes “relatively” simple.
The concept of Universally Unique Identifier (UUID) comes in very handy when multiple servers need to generate IDs.
UUIDs are guaranteed to be unique not just in the scope of one system, but universally. In other words, you can be confident that if you generate a proper UUID, then, irrespective of all other details, this identifier won’t be used anywhere else in the world. This guarantee allows you to generate identifiers on multiple servers even if they’re completely independent.
Downsides of Server Generated IDs
The main disadvantage of server-generated IDs is that they are server-generated. I know that it sounds a bit silly, so let me explain.
Imagine that you have a client application that communicates with the server over the internet. As long as internet connection is live, the client can communicate all user’s actions to the server immediately. This way the server has an opportunity to generate identifiers for the new data (if required) and return them to the client. The client then uses the new data with these identifiers and everything is good.
However, connectivity can be lost. There are many reasons for that, but the details aren’t important in the context of our discussion here. So, what happens when the client loses internet connection?
Well, one option would be to notify the user about the problem and disallow to do any operations on the client. That’s the way most web-sites operate. However, this approach might is non-optimal user experience and not all platforms as “forgiving” as web-browsers.
For example, if I write an email on my mobile phone and click on send button, the last thing I want to see is error notification. I expect my email client to cache the email and deliver it once connectivity is restored. This kind of offline work is a basic functionality expected from many applications on mobile devices today.
But the above use case would be problematic to support because I want to create a new piece of data (email) that must have an identifier, but I can’t reach out to the server to get one. That’s not good.
Looks like there are cases where server-side generation of IDs is simply insufficient.
Client Generated IDs
So, what do I do when I need to generate new IDs, but can’t communicate with the server? I generate these IDs on the client. These are called “client-generated” IDs.
Advantages of Client-Generated IDs
The main advantage of client-generated IDs is that you don’t need the server to generate them.
Downsides of Using Client Generated IDs
Now, finally, I get to the original question that motivated this post: “what problems do you see with client-generated IDs?”.
Originally I said that it’s not about the risk of making the identifiers non-unique, but about functional concerns. I was wrong. From purely functional point of view, there is no issue with client-generated IDs. You can even build applications where almost all IDs will be client-generated and these applications will work.
However, after giving this concept a serious thought, I believe that the real question we should ask is this: even assuming that all IDs are UUIDs, can you really trust that client-generated IDs will be unique? My initial feeling was that you can, but now I see some nuances.
First, you should probably take into account that there might be bugs that will lead to non-unique IDs being sent from the clients to the server. Sure, such bugs can happen on the server as well, but each additional client type that you add increases this risk.
Second, you should also take into account security risks. If malicious actors manage to feed non-unique IDs into your system, it can lead to some serious troubles.
Now, there are approaches to mitigate these risks. For example, you can prepend server-generated “client IDs” to client-generated IDs. The resulting ID will look like this: [client-ID][UUID]. This way, server effectively reserves a “range” of IDs for each client and one client can’t interfere with IDs generated by other clients. But that’s just mitigation strategy which doesn’t address the root of the problem.
As long as the server persists client-generated IDs, the only way to ensure that there will be no duplicates in the system is to validate uniqueness of all new IDs received from clients. It means that each time a new data gets created, the server will compare its ID to all other IDs in the system. This might be a problem.
Due to the format of UUID, ensuring its uniqueness might introduce performance issues even in a single database. If database becomes distributed, potentially across geographical regions, I’d expect the performance penalty to become a limiting factor.
So, the potential problems with client-genreated IDs are: reliability, security and performance.
Unfortunately, I couldn’t arrive at a single conclusion.
So far, I’ve always used server generated IDs in my applications. Whenever it was insufficient, we used client-generated IDs and then implemented a sophisticated scheme for exchanging these temporal IDs for serer-generated permanent ones. This required a lot of work and thought, but it felt safe and reliable.
Now I know that going all the way with client-generated IDs is also an option. However, I don’t know for sure what are the trade-offs involved in this decision. Using client-generated IDs looks like a simple and universally applicable solution, but I’ve never seen it implemented in practice. So, either there is a huge global over-engineering going on, or client-generated IDs have some non-obvious drawbacks.
I listed some of the points I’d consider before adopting client-generated IDs, but I’m not claiming that they are deal breakers or even real. For instance, I’m not sure that it’s really necessary to ensure IDs uniqueness on the server if you go with client-generated IDs. I don’t know whether injection of malicious IDs is a viable attack vector in this case. I don’t know whether the performance issue is real, or that’s just Knuth’s preliminary optimization all over again.
In some sense, I’m left with more questions than I started my research with.
Hopefully, developers with more experience in this area will read this post and help us get to the bottom of it in the comments section.