Counting hash collisions with the birthday paradox

[article index] [] [@mattmight] [+mattmight] [rss]

The birthday paradox observes that in a room of 23 people, the odds that at least two people share a birthday is 50%

The same logic that drives matching birthdays also drives the probability that one can find collisions with a hash function.

In other words, if you have a uniform hashing function that outputs a value between 1 and 365 for any input, the probability that two hashes would collide in a set of 23 values is also 50%

Another useful calculation is the expected number of collisions for a sequence of $n$ values when the range of the hash function contains $D$ hashes.

The closed form solution is:

$n - D + D \left( \frac{D-1}{D} \right)^n$

There are a few places online that have this (or an equivalent) closed form solution listed, but I couldn’t find anywhere that included the derivation of this form, so I’ve rederived it and posted it here as a technical note.

(My present interest in this calculation comes from the number of matches that will happen in a patient-matching network that attempts to match patients having the same disease, assuming the are $D$ total diseases possible and $n$ patients in the network.)

Read below for the derivation in terms of generalized birthdays.

Let $D$ be the number of days in the year.

For real birthdays, $D$ would be 365, but for hashing, $D$ is the size of the range of the hash function.

In a room with $k-1$ people, the probability that the next ($k$th person) will not have a matching birthday to someone already in the room is:

$\left(\frac{D-1}{D}\right)^{k-1}$

Hence, the probability that the $k$th person will have a match is:

$1 - \left(\frac{D-1}{D}\right)^{k-1}$

To compute the expected number of matches in a room of $n$ people, we have to calculate:

• the additional matches added by the $k$th person plus
• the additional matches added by the $n$th person.

since each additional person will either match or not match, the expected number of additional matches added by the $k$th person is:

$1 - \left(\frac{D-1}{D}\right)^{k-1}$

So, the number of expected matches for $n$ people is:

$\sum_{k=1}^n \left[ 1 - \left(\frac{D-1}{D}\right)^{k-1} \right]$

which simplifies to:

$n - \sum_{k=1}^n \left(\frac{D-1}{D}\right)^{k-1}$

Using the closed form solution for a geometric series, wherein: $\sum_{k=0}^m r^k = \frac{1 - r^{m+1}}{1 - r}$ we can compute a closed form for the right-hand side:

$\sum_{k=1}^n \left(\frac{D-1}{D}\right)^{k-1}$

$= \sum_{k=0}^{n-1} \left(\frac{D-1}{D}\right)^{k}$

$= \frac{1 - \left(\frac{D-1}{D}\right)^n}{1-\frac{D-1}{D}}$

$= \frac{1 - \left(\frac{D-1}{D}\right)^n}{\frac{D}{D}-\frac{D-1}{D}}$

$= \frac{1 - \left(\frac{D-1}{D}\right)^n}{\frac{D - D + 1}{D}}$

$= \frac{1 - \left(\frac{D-1}{D}\right)^n}{\frac{1}{D}}$

$= D \left( 1 - \left(\frac{D-1}{D}\right)^n \right)$

$= D - D \left(\frac{D-1}{D}\right)^n$

Substituting back into the original formula yields: $n - D + D \left( \frac{D-1}{D} \right)^n$ for the number of expected matches in a room of $n$ people.

Notes

The derivation here sheds some light on the applicability of this commonly cited formula.

For instance, it does not count the expected number of unique hits or the expected number of pairings made.

Rather, it counts the number of times that the next person to enter the room has a match.

So, for example, if three people walk in all born on July 1st, the first person does not match (and cannot match), but the second matches and the third matches, for a total of two matches, despite the fact that there is one unique match and three matching pairs.