Let \(D\) be the number of days in the year.
For real birthdays, \(D\) would be 365, but for hashing, \(D\) is the size of the range of the hash function.
In a room with \(k-1\) people, the probability that the next ($k$th person) will not have a matching birthday to someone already in the room is:
\[ \left(\frac{D-1}{D}\right)^{k-1} \]
Hence, the probability that the $k$th person will have a match is:
\[ 1 - \left(\frac{D-1}{D}\right)^{k-1} \]
To compute the expected number of matches in a room of \(n\) people, we have to calculate:
- the additional matches added by the 1st person plus
- the additional matches added by the 2nd person plus
- the additional matches added by the 3rd person plus
- …
- the additional matches added by the $k$th person plus
- …
- the additional matches added by the $n$th person.
since each additional person will either match or not match, the expected number of additional matches added by the $k$th person is:
\[ 1 - \left(\frac{D-1}{D}\right)^{k-1} \]
So, the number of expected matches for \(n\) people is:
\[ \sum_{k=1}^n \left[ 1 - \left(\frac{D-1}{D}\right)^{k-1} \right] \]
which simplifies to:
\[ n - \sum_{k=1}^n \left(\frac{D-1}{D}\right)^{k-1} \]
Using the closed form solution for a geometric series, wherein: \[ \sum_{k=0}^m r^k = \frac{1 - r^{m+1}}{1 - r} \] we can compute a closed form for the right-hand side:
\[ \sum_{k=1}^n \left(\frac{D-1}{D}\right)^{k-1} \]
\[ = \sum_{k=0}^{n-1} \left(\frac{D-1}{D}\right)^{k} \]
\[ = \frac{1 - \left(\frac{D-1}{D}\right)^n}{1-\frac{D-1}{D}} \]
\[ = \frac{1 - \left(\frac{D-1}{D}\right)^n}{\frac{D}{D}-\frac{D-1}{D}} \]
\[ = \frac{1 - \left(\frac{D-1}{D}\right)^n}{\frac{D - D + 1}{D}} \]
\[ = \frac{1 - \left(\frac{D-1}{D}\right)^n}{\frac{1}{D}} \]
\[ = D \left( 1 - \left(\frac{D-1}{D}\right)^n \right) \]
\[ = D - D \left(\frac{D-1}{D}\right)^n \]
Substituting back into the original formula yields: \[ n - D + D \left( \frac{D-1}{D} \right)^n \] for the number of expected matches in a room of \(n\) people.
Notes
The derivation here sheds some light on the applicability of this commonly cited formula.
For instance, it does not count the expected number of unique hits or the expected number of pairings made.
Rather, it counts the number of times that the next person to enter the room has a match.
So, for example, if three people walk in all born on July 1st, the first person does not match (and cannot match), but the second matches and the third matches, for a total of two matches, despite the fact that there is one unique match and three matching pairs.