Ask a statistician: A variation of the birthday problem, part 2
Alec's question, again, is: "I’ve read about the birthday problem, and how you only need 23 randomly chosen people for there to be a 50% chance that two people share a birthday. But how many people would you need for there to be a 50% chance that every possible birthday is represented by at least one person?"
Mario Cortina Borja replies: More than 365 people, clearly. But how many more? According to my estimates, you would need to gather together 2285 people for there to be a greater than 50% chance that all birthdays (excluding the leap year day of 29 February) are taken by at least one person, and more than 2980 for there to be a greater than 90% chance.
In his email to Significance, Alec says he became interested in this question when he noticed that “among our 3000 or so graduates last year only one birthday was not taken”. Based on my estimates, there was less than a 10% chance that this one birthday would go unclaimed.
I used simulation in the statistical software R to estimate these numbers. In general terms, I was looking to work out the probability of observing all possible birth dates among a sample of people, using the simplifying assumption that births within the population are uniformly distributed over all possible days.
Mathematically, we express this as estimating p(n, M), which is the probability of observing all the elements of the set D_{n} = {1, 2, …, n} in a sample of M subjects, assuming a uniform distribution over D_{n}. For birthdays, excluding 29 February, we have n = 365. To estimate p(n, M), I simulated B samples of size M using the R function p_hat (see box). I quickly found that M ≈ 3000 was an approximate solution, so I simulated B = 10 000 samples each for sizes 1200 ≤ M ≤ 5000 in increments of 10.
The row marked uniform365d in Table 1 shows values for selected quantiles resulting from these simulations; the values were obtained as predictions from a smoothing spline model. I do not include the confidence intervals for these estimates, but they are quite tight. The median (0.5) is 2285 people, and the 0.9 quantile is 2980.
Table 1. Estimated quantiles for the modified birthday problem, using one uniform and two empirical distributions based on live births from England and Wales, 1979–2014
Probabilities  
Distributions  .005  .01  .025  .05  .1  .5  .9  .95  .975  .99  .995 
uniform365d  1561  1610  1686  1756  1858  2285  2980  3226  3502  3794  4050 
empirical366d  1555  1657  1737  1812  1916  2435  3642  4456  5391  6758  7639 
empirical356d  1553  1603  1694  1764  1862  2296  3002  3265  3531  3849  4112 
The R code
p_hat< function(n=365, M=3000, B=10000, emp.prob=rep(1,n)/n)
{
### Returns the estimated probability of covering all labels
### D_n = {1,2,...n}
###
### It generates B simulations of extracting M samples
### from D_n with replacement using the
### probability distribution specified by weights emp.prob
### MCB, London, 02.11.16
###
invisible(
sum(
apply(
matrix(
sample(1:n, M*B, replace=TRUE, prob=emp.prob),
nrow=B),
1, function(x){length(unique(x))==n})
)/B
)}

What would happen if we relaxed the assumption of uniformity of birthdays and a 365day year?
Using data provided by the Office for National Statistics, I considered the birthdays of the 23 872 409 live births registered in England and Wales between 1979 and 2014. This adjusts for (i) leap year births on 29 February, which constitute just 0.068% of all births; (ii) the excess of births in the last week of September, corresponding to conceptions in the Christmas holidays, and the deficit of births in the Christmas holidays, reflecting health services management policies; and (iii) the marked dependence on day of the week of birth, which is integrated out by accumulating the live birth frequencies by day of the year.
Clearly birth dates now vary in frequency, but how does this affect the distribution quantiles? The row in Table 1 marked empirical366d is based on the frequencies of live births including 29 February. The median of 2435 is 6.6% higher than that based on the uniform distribution, while the 0.9 quantile is 22% higher. To clarify this “leap day” effect, I omitted births on 29 February and reestimated the empirical quantiles. Results in the row marked empirical365d show that the median and 0.9 quantile are now 2296 and 3002, only 1% greater than the uniform distribution quantiles.
 Mario Cortina Borja is chairman of the Significance editorial board, and professor of biostatistics in the Population Policy and Practice Programme, Institute of Child Health, University College London.
 Our next question is: What are the odds of a person becoming a statistician?, as suggested by @BobOHara, via Twitter. Send your answer to This email address is being protected from spambots. You need JavaScript enabled to view it..