]> The Birthday Problem
  1. Virtual Laboratories
  2. 12. Finite Sampling Models
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9

6. The Birthday Problem

Introduction

The Sampling Model

As in the basic sampling model, suppose that we select n numbers at random, with replacement, from the population D 1 2 m . Thus, our outcome vector is

X X 1 X 2 X n

where X i D is the i number chosen. Recall that our basic modeling assumption is that X is uniformly distributed on the sample space

S D n 1 2 m n

In this section, we are interested in the number of population values missing from the sample, and the number of (distinct) population values in the sample. The computation of probabilities related to these random variables are generally referred to as birthday problems. Often, we will interpret the sampling experiment as a distribution of n balls into m cells; X i is the cell number of ball i . In this interpretation, our interest is in the number of empty cells and the number of occupied cells.

Multinomial Distribution

For i D , let Y n i denote the number of times that i occurs in the sample:

Y n i j 1 2 n X j i

Show that Y n Y n 1 Y n 2 Y n m has the multinomial distribution with parameters n and 1 m 1 m 1 m :

Y n 1 k 1 Y n 2 k 2 Y n m k m n k 1 k 2 k m 1 m n  for  k 1 k 2 k m m  with  i 1 m k i n

Random Variables

We will now define the main random variables of interest: the number of population values missing in the sample:

U m n j 1 2 m Y n j 0

and the number of (distinct) population values that occur in the sample:

V m n j 1 2 m Y n j 0

Clearly we must have U m n V m n m so once we have the probability distribution and moments of one variable, we can easily find them for the other variable. However, we will first solve the simplest version of the birthday problem.

The Simple Birthday Problem

The event that there is at least one duplication in the sample can be written as

B m n V m n n U m n m n

The (simple) birthday problem is to compute the probability of this event. For example, suppose that we choose n people at random and note their birthdays. If we ignore leap years and assume that birthdays are uniformly distributed throughout the year, then our sampling model applies with m 365 . In this setting, the birthday problem is to compute the probability that at least two people have the same birthday (this special case is the origin of the name).

The solution of the birthday problem is an easy exercise in combinatorial probability.

Use the multiplication rule of combinatorics to show that

B m n 1 m n m n n m 1 n m

Hint: The complementary event S B m n occurs if and only if the outcome vector X forms a permutation of size n from D 1 2 m .

The fact that the probability is 1 for n m is sometimes referred to as the pigeonhole principle: if more than m pigeons are placed into m holes then at least one hole has 2 or more pigeons.

A Recurrence Relation

Let p m n denote the probability of the complementary event, S B m n , that the sample variables are distinct. Prove the following recursion relation in two ways: first starting with the result in Exercise 2, and then by using a conditional probability argument.

  1. p m 1 1
  2. p m n 1 m n m p m n

Examples

Let m 365 (the standard birthday problem). Verify the following birthday probabilities:

  1. B 365 10 0.117
  2. B 365 20 0.411
  3. B 365 30 0.706
  4. B 365 40 0.891
  5. B 365 50 0.970
  6. B 365 60 0.994

Graph the values in the previous exercise as a function of n . When smoothed (for the sake of appearance), your curve should look like the graph below.

Probability of the birthday event

In the birthday experiment, set m 365 . and select the indicator variable I . For n 10 20 30 40 50 60 run the experiment 1000 times each and compute the relative frequency of the event that the sample contains a duplication. Compare the relative frequencies with the probabilities computed in the previous exercise.

In spite of its easy solution, the birthday problem is famous because, numerically, the probabilities can be a bit surprising. Note that with a just 60 people, the event is almost certain! Mathematically, the rapid increase in the birthday probability, as n increases, is due to the fact that m n grows much faster than m n .

Four fair, standard dice are rolled. Find the probability that the scores are distinct.

In the birthday experiment, set m 6 and select the indicator variable I . Vary n with the scrollbar and note graphically how the probabilities change. Now with n 4 , run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the corresponding probability.

Five persons are chosen at random.

  1. Find the probability that at least 2 have the same birth month.
  2. Criticize the sampling model in this setting

In the birthday experiment, set m 12 and select the indicator variable I . Vary n with the scrollbar and note graphically how the probabilities change. Now with n 5 , run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the corresponding probability.

A fast-food restaurant gives away one of 10 different toys with the purchase of a kid's meal. A family with 5 children buys 5 kid's meals. Find the probability that the 5 toys are different.

In the birthday experiment, set m 10 and select the indicator variable I . Vary n with the scrollbar and note graphically how the probabilities change. Now with n 5 , run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the corresponding probability.

Let m 52 . Find the smallest value of n such that the probability of a duplication is at least 12 .

The General Birthday Problem

We now return to the more general problem of finding the distribution of the number of distinct sample values and the distribution of the number of excluded sample values.

The Probability Density Function

For j D , consider the event that j does not occur in the sample: A n j Y n j 0 . Now let K D with K k . Using the multiplication rule of combinatorics, it is easy to count the number of samples that do not contain any elements of K :

Show that

j K A n j m k n

Now the inclusion-exclusion rule of combinatorics can be used to count the number samples that are missing at least one population value:

Show that

j 1 m A n j k 1 m 1 k 1 m k m k n

Once we have this, it's simple to count the number samples that contain all population values:

Show that

j 1 m A n j c k 0 m 1 k m k m k n

Now we can use a two-step procedure to generate all samples that exclude exactly j population values:

  1. First, choose the j values that are to be excluded.
  2. Then select a sample of size n from the remaining population values so that none are excluded.

Thus, we can use the multiplication principle of combinatorics to count the number of samples that exclude j population values.

Show that

U m n j n j k 0 m j 1 k m j k m j k n

Finally, since the probability distribution of X on the sample space S is uniform, we can find the probability density function of the number of excluded values:

Show that

U m n j n j k 0 m j 1 k m j k 1 j k m n ,  j m n 0 m 1

We can now easily find the probability density function of the number distinct values in the sample:

Show that

V m n j n j k 0 j 1 k j k j k m n ,  j 1 2 m n

In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times updating every 10 runs and note the apparent convergence of the relative frequency function to the probability density function.

A Recurrence Relation

The distribution of the number of excluded values can also be obtained by a recursion argument.

Let p m n j U m n j for j m n 0 m 1 . Use probability arguments to show that

  1. p m 1 m 1 1
  2. p m n 1 j m j m p m n j j 1 m p m n j 1

Moments

Now we will find the means and variances. The number of excluded values and the number of distinct values are counting variables and hence can be written as sums of indicator variables. As we have seen in many other models, this representation is frequently the best for computing moments.

Let I n j A n j . Thus, I n j 1 if A n j occurs, which means that j is not in the sample, and I n j 0 otherwise. Note that the number of population values missing in the sample can be written as the sum of the indicator variables:

U m n j 1 m I n j

Show that

  1. I n j 1 1 m n for j 1 2 m
  2. I n i I n j 1 2 m n for i j 1 2 m 2 with i j

Use the results of Exercise 22 to show that

  1. U m n m 1 1 m n
  2. V m n m 1 1 1 m n

Use the result of Exercise 22 to show that

  1. I n j 1 1 m n 1 1 m 2 n for j 1 2 m
  2. I n i I n j 1 2 m n 1 1 m 2 n for i j 1 2 m 2 with i j

Use the results of the Exercise 24, and basic properties of variance to show that

U m n V m n m m 1 1 2 m n m 1 1 m n m 2 1 1 m 2 n

In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the size and location of the mean/standard-deviation bar. For selected values of the parameters, run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample mean and variance to the distribution mean and variance.

Examples and Applications

Suppose that 30 persons are chosen at random.

  1. Find the probability density function of the number of distinct birthdays.
  2. Find the mean of the number of distinct birthdays.
  3. Find the variance of the number of distinct birthdays.
  4. Find the probability that there are at least 28 different birthdays represented.

In the birthday experiment, set m 365 . and n 30 , run the experiment 1000 times with an update frequency of 10 and compute the relative frequency of the event in part (d) of the last exercise.

Suppose that 10 fair dice are rolled.

  1. Find the probability density function of the number of distinct scores.
  2. Find the mean of the number of distinct scores.
  3. Find the variance of the number of distinct scores.
  4. Find the probability that there will 4 or fewer distinct scores.

In the birthday experiment, set m 6 and n 10 , run the experiment 1000 times with an update frequency of 10 and compute the relative frequency of the event in part (d) of the last exercise.

A fast food restaurant gives away one of 10 different types of toy with the purchase of each kid's meal. A family buys 15 kid's meals.

  1. Find the probability density function of the number of toy types that are missing.
  2. Find the mean of the number of toy types that are missing.
  3. Find the variance of the number of toy types that are missing.
  4. Find the probability that at least 3 toy types are missing.

In the birthday experiment, set m 10 and n 15 , run the experiment 1000 times with an update frequency of 10 and compute the relative frequency of the event in part (d).

The lying students problem. Suppose that 3 students, who ride together, miss a mathematics exam. They decide to lie to the instructor by saying that the car had a flat tire. The instructor separates the students and asks each of them which tire was flat. The students, who did not anticipate this, select their answers independently and at random.

  1. Give the probability density function of the number of distinct answers.
  2. In particular, give the probability that the students get away with their deception.
  3. Give mean of the number of distinct answers.
  4. Give the standard deviation of the number of distinct answers.

The duck hunter problem. Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at random and shoots.

  1. Give the probability density function of the number of ducks that are killed.
  2. Give mean of the number of ducks that are killed.
  3. Give the standard deviation of the number of ducks that are killed..