Collision Risk in Hash-Based Surrogate Keys

Author:Murphy | View: 28703 | Time: 2025-03-22 19:38:06

Abstract visualization of a game of low chances in a multidimensional space. Don't get fooled by the oddly looking dice! | Image by DALL·E

One method for generating Surrogate Keys in databases (particularly in Data Warehouses and Lakehouses) relies on a hash function to compute hash keys from Natural Keys. This method has many advantages but it also comes with a significant risk: hash functions do not guarantee unique outputs, leading to the possibility of hash key collision.

Hash collision is when different input values passed through a hash function return the same output value (i.e., the same hash). The probability of such an event largely depends on the length of the hash key generated by the specific type of hash function used. The longer the hash key, the lower the risk of collision.

Hash functions

The three most popular hashing functions used nowadays are:

MD5 (Message Digest Algorithm 5)— Developed by Ronald Rivest in 1991, is a widely known hashing function that produces a 128-bit (16-byte) hash. Originally designed for data integrity and authentication, MD5 quickly became popular due to its simplicity and speed.
SHA-1 (Secure Hash Algorithm 1) – Created by the National Security Agency (NSA) and published by the National Institute of Standards and Technology (NIST) in 1993 as part of the Digital Signature Algorithm (DSA). It generates a 160-bit (20-byte) hash, which offers better collision resistance than MD5.
SHA-256 (Secure Hash Algorithm 256-bit) – Also developed by the NSA and published in 2001. It produces a 256-bit (32-byte) hash, offering significantly improved collision resistance and security over SHA-1. SHA-256 is commonly used in cryptographic applications, digital certificates, and data integrity verification. It remains a popular and trusted hashing algorithm due to its robust security properties and widespread support across cryptographic libraries and platforms.

All of them are natively supported by the most popular database engines, so they are good candidates for computing hash-based Surrogate Keys.

The table below summarizes the number of unique values that can be generated using these three functions and their corresponding theoretical probability of hash collision when generating hashed from two random values.

Key characteristics of MD5, SHA-1, and SHA-256 hash functions | Image by author

While the collision probability (for two random values) is easy to compute, it's not really useful when considering these functions for SK generation in databases. In practice, we deal with database tables storing massive amounts of these hash values as SK, so in particular Primary Keys (which must be unique for each row in a given table). So how to compute the collision probability in a useful way? This is where the so-called "birthday paradox" comes in handy.

Birthday paradox

The "birthday paradox" is an interesting concept in probability theory that explores the likelihood of at least two people in a group sharing the same birthday. Despite its name, it is not a true paradox but rather a surprising result that defies our intuition. Most people expect a group to be very large for shared birthdays to be likely. In reality, with just 23 people in a room, there is a greater than 50% chance that two people will share the same birthday.

If you want to learn more about the paradox, here's a good reading: Understanding the Birthday Paradox – BetterExplained

For the sake of this article, what's important from the paradox, is to compute the risk of a hash collision in a table having n rows, each having a unique Natural Key (from which the hash is being computed), and while using the hash function returning b bits long hashes, is:

Exact formula for hash collision probability | Image by author

For large numbers of possible hash values (which, with 128 bits and longer values, is definitely our case) the exact formula can simplified to this form:

Approximated formula for hash collision probability | Image by author

Collision probability for Surrogate Keys

Having the math formula, we can calculate the risk (i.e., probability) of hash collisions for different hash functions (generating different lengths of hash keys) and different table sizes.

The table below presents the probabilities for MD5, SHA-1, and SHA-256 functions of SK hash collisions for inserting an n-th record into a table. The probabilities are displayed with a precision of 12 fractional digits.

Hash-based Surrogate Key collision probabilities for various record counts and hash functions | Image by author

But how to interpret these probabilities? Is 0.000000000001 probability an acceptable risk? How many records can you load daily not to exceed it? Let's try to visualize this using different real-world analogies.

Visualizing the SK collision risk

To make an easy-comprehend analogy of these very low probabilities, we should simplify our problem space, i.e. reduce the number of variables we have. Let's start with an assumption that we work with the MD5 algorithm, which generates 128-bit hashes. Therefore our b (binary length of hash values) is 128. We will make additional assumptions further on.

EuroMillions and Powerball analogy

Have you ever played in any large-scale lottery games EuroMillions or Powerball, where you can win millions of EUR or USD when you properly predict the winning numbers? I haven't, but in my family, I have someone who regularly plays Lotto (it's the most popular national lottery in Poland). He always bets on his "lucky" numbers. It's been over 30 years already and he still hasn't hit the jackpot.

Let's focus on EuroMillions and Powerball. Chances of winning the jackpot are:

EuroMillion: 1 in 139,838,160
Powerball: 1 in 292,201,338

These are, accordingly, ~7,000 and ~3,500 times more likely than having the hash collision with 1 in a trillion chances.

Decreasing the borderline hash collision risk 1,000,000 times to 1 in a quintillion (i.e., 0.000000000000000001), makes it so small, that even winning the jackpot twice in EuroMillions or Powerball is more probable than that. And note that there's no evidence of any individual that won the jackpot twice in a regular edition of these games.

Data ingestion analogy

Let's assume you have a very intense volume of records that you ingest into your table. And you need to make sure you will not exceed the 1 in a quintillion probability of a hash collision in the perspective of the next 20 years. How many records can you load into your table?

The table below summarizes the results computed using the approximated formula for hash collision probability.

Record count and related ingestion speed until reaching the borderline collision probability of 1 in a quintillion for hash-based Surrogate Keys | Image by author

Conclusions:

To exceed a 1 in a quintillion probability of an MD5 hash collision over a 20-year timespan, you must insert more than 3.6 million records daily (or more than 41 records per second).
For the SHA-1 function, you would need to insert approximately 2.7 million records per second, which is feasible for a very large database cluster, but only likely in some applications.
With the SHA-256 algorithm, the required rate is 760 quintillion records per day, far beyond current technological capabilities.

So, if you are dealing with data volume in a range of a few million records daily (per table) or below, you are good to go with MD5. Above that, you might consider relying on SHA-1.

Let's take yet another perspective. Were you considering using an incremental SK using a bigint data type? If so, you were probably not afraid of reaching the data type value limit (which is an enormous 9,223,372,036,854,775,807 for a signed bigint). Your potential maximum record count would never exceed 1/10,000-th of it, right (so you have a few orders of magnitude safety buffer)? If so, when using MD5 hashes, even after reaching your maximum record count, the collision probability would be 0,00000000125 (that's roughly 1 in 800 million, so less likely than hitting a jackpot in EuroMillions or Powerball). Is that an acceptable risk?

Let the money talk

When assessing risks and their various avoidance/mitigation plans in quantitative risk analysis, it's common to use the so-called Expected Monetary Value (EMV) of a risk using the following formula:

Expected Monteary Value (EMV) formula for a risk | Image by author

where P is the probability of a risk and C is its impact (i.e., cost).

Then, the EMV of the risk materialization can be compared to the cost of mitigation strategies.

Let's consider the following simplified example: a company uses a single spindle disk to store critical database data. There's a risk of hard drive failure, which could result in data loss and operational downtime. Industry data suggests that the annual failure rate for spindle disks is approximately 2%. In the case of the spindle failure, the cost of data recovery, downtime losses, and potential reputational damage is estimated at 500,000 EUR. The EMV of the risk is thus 10,000 EUR.

A sample mitigation strategy is to set up a RAID system to reduce the risk of data loss due to hard drive failure. Its cost is estimated at 5,000 EUR, which is 2x less than the EMV of the risk. Therefore it makes perfect sense to invest in the RAID system.

Let's now apply the same approach to the risk of hash-based Surrogate Keys collision and reuse some of the numbers from the previously made calculations. Here's what we got:

hash algorithm used: MD5
record ingestion speed: 41 records per second
hash collision risk probability: 1 in a quintillion
estimated cost of dealing with the aftermath of the hash collision after it occurred: ~100M EUR (probably vastly exaggerated, but that's OK)

For the numbers above the EMV is way below 0,01 EUR. With the EMV close to zero, it's not reasonable to invest in any avoidance strategies (like a sophisticated solution to avoid hash collisions).

Dealing with SK collision risk

Preventing the risk

Preventing the risk of hash collisions is an idealistic approach. Apart from not being justified from the risk's cost perspective, it's also challenging for the implementation. Preventing the risk means checking on the fly whether there is a collision for any newly computed hash. For that to happen you must check the pair of a newly computed hash and its origin Natural Key against all already generated pairs of hashes and their origin NK. The more hashes you've already generated the more costly the operation is. You may incorporate some advanced techniques to optimize the checks (like Bloom filters), but still you must consider that:

there is a significant additional cost of implementing such checks,
it will delay the delivery date of the overall solution,
it will slow down the data processing from the very beginning (and the more data is already processed, the bigger the impact on the performance you will have).

So, considering your data characteristics and the business context you have, you must answer the question yourself of whether the prevention is worth all the hassle.

Accepting the risk

Considering the extremely low risk of the collision to happen, my go-to approach is to accept the risk. But that doesn't mean to do nothing. It's very important to be able to react when the risk materializes. And to be able to do that, you still need to monitor the tables for a potential hash collision.

Unfortunately, there's no universal method to achieve that while keeping the mechanism simple and with minimum performance and cost impact. Different techniques can be used depending on the actual use case and they depend on:

the data characteristics – e.g., whether you expect updates to happen, and if so whether you track the history of changes (like SCD1 vs SCD2),
technology capabilities – e.g., whether it can generate an event or a notification when a specific condition happens while inserting/updating records in a target table; or maybe it can redirect records meeting specific criteria to a dedicated table while inserting/updating records in a target table,
business context – e.g., whether you need to react to potential duplicates as soon as they happen (so you need to detect collisions on the fly) or it's OK to react with some delay (so an asynchronous process, checking the data after they were written and possibly used, would make it).

When designing the monitoring mechanism for hash collision remember about the extremely low risk of the collision to happen, so make it simple and don't overengineer it. If the business insists on having synchronous detection capabilities (which is actually an equivalent of the prevention mechanism) and it's much simpler to implement an async process for it, approach it from a money perspective: calculate the Expected Monetary Value for both these scenarios and let the money decide.

Reacting to collisions after they happen

Alright! Let's finally discuss what could be done when the risk eventually materializes and the hash collision happens. Our monitoring system informed us about it and we need to react.

Life-hack: If you are prone to such extremely rare events that hash collision happens to you
Tags: Best Practices Data Architecture Data Engineering Database Tips And Tricks

Add Fav

Comment

Murphy

Add friends

View space

Message

Recommend

◦ Uncovering the Link Between Municipalities' Influence, Public Health Spending, and PatientR

◦ Sklearn tutorial: module 2

◦ How to Make the Most Out of LLM Production Data: Simulated User Feedback

◦ How to Generate Videos with Open-Sora-Plan Video Generation Model

◦ RAG Isn't Immune to LLM Hallucination

◦ Confidence Interval vs. Prediction Interval

◦ KAN: Why and How Does It Work? A Deep Dive

◦ Designing the relationship between LLMs and user experience

◦ Creating Time Series Plots in Grafana

◦ How and Why to use LLMs for Chunk-Based Information Retrieval

◦ Data as a Product: From Concept to Reality

◦ Optimization or Architecture: How to Hack Kalman Filtering