]>
As in the basic sampling model, suppose that we select numbers at random, with replacement, from the population . Thus, our outcome vector is
where is the number chosen. Recall that our basic modeling assumption is that is uniformly distributed on the sample space
In this section, we are interested in the number of population values missing from the sample, and the number of (distinct) population values in the sample. The computation of probabilities related to these random variables are generally referred to as birthday problems. Often, we will interpret the sampling experiment as a distribution of balls into cells; is the cell number of ball . In this interpretation, our interest is in the number of empty cells and the number of occupied cells.
For , let denote the number of times that occurs in the sample:
Show that has the multinomial distribution with parameters and :
We will now define the main random variables of interest: the number of population values missing in the sample:
and the number of (distinct) population values that occur in the sample:
Clearly we must have so once we have the probability distribution and moments of one variable, we can easily find them for the other variable. However, we will first solve the simplest version of the birthday problem.
The event that there is at least one duplication in the sample can be written as
The (simple) birthday problem is to compute the probability of this event. For example, suppose that we choose people at random and note their birthdays. If we ignore leap years and assume that birthdays are uniformly distributed throughout the year, then our sampling model applies with . In this setting, the birthday problem is to compute the probability that at least two people have the same birthday (this special case is the origin of the name).
The solution of the birthday problem is an easy exercise in combinatorial probability.
Use the multiplication rule of combinatorics to show that
Hint: The complementary event occurs if and only if the outcome vector forms a permutation of size from .
The fact that the probability is 1 for is sometimes referred to as the pigeonhole principle: if more than pigeons are placed into holes then at least one hole has 2 or more pigeons.
Let denote the probability of the complementary event, , that the sample variables are distinct. Prove the following recursion relation in two ways: first starting with the result in Exercise 2, and then by using a conditional probability argument.
Let (the standard birthday problem). Verify the following birthday probabilities:
Graph the values in the previous exercise as a function of . When smoothed (for the sake of appearance), your curve should look like the graph below.
In the birthday experiment, set . and select the indicator variable . For run the experiment 1000 times each and compute the relative frequency of the event that the sample contains a duplication. Compare the relative frequencies with the probabilities computed in the previous exercise.
In spite of its easy solution, the birthday problem is famous because, numerically, the probabilities can be a bit surprising. Note that with a just 60 people, the event is almost certain! Mathematically, the rapid increase in the birthday probability, as increases, is due to the fact that grows much faster than .
In the birthday experiment, set and select the indicator variable . Vary with the scrollbar and note graphically how the probabilities change. Now with , run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the corresponding probability.
Five persons are chosen at random.
In the birthday experiment, set and select the indicator variable . Vary with the scrollbar and note graphically how the probabilities change. Now with , run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the corresponding probability.
A fast-food restaurant gives away one of 10 different toys with the purchase of a kid's meal. A family with 5 children buys 5 kid's meals. Find the probability that the 5 toys are different.
In the birthday experiment, set and select the indicator variable . Vary with the scrollbar and note graphically how the probabilities change. Now with , run the experiment 1000 times with an update frequency of 10. Note the apparent convergence of the relative frequency of the event to the corresponding probability.
We now return to the more general problem of finding the distribution of the number of distinct sample values and the distribution of the number of excluded sample values.
For , consider the event that does not occur in the sample: . Now let with . Using the multiplication rule of combinatorics, it is easy to count the number of samples that do not contain any elements of :
Show that
Now the inclusion-exclusion rule of combinatorics can be used to count the number samples that are missing at least one population value:
Show that
Once we have this, it's simple to count the number samples that contain all population values:
Show that
Now we can use a two-step procedure to generate all samples that exclude exactly population values:
Thus, we can use the multiplication principle of combinatorics to count the number of samples that exclude population values.
Show that
Finally, since the probability distribution of on the sample space is uniform, we can find the probability density function of the number of excluded values:
Show that
We can now easily find the probability density function of the number distinct values in the sample:
Show that
In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times updating every 10 runs and note the apparent convergence of the relative frequency function to the probability density function.
The distribution of the number of excluded values can also be obtained by a recursion argument.
Let for . Use probability arguments to show that
Now we will find the means and variances. The number of excluded values and the number of distinct values are counting variables and hence can be written as sums of indicator variables. As we have seen in many other models, this representation is frequently the best for computing moments.
Let . Thus, if occurs, which means that is not in the sample, and otherwise. Note that the number of population values missing in the sample can be written as the sum of the indicator variables:
Show that
Use the results of Exercise 22 to show that
Use the result of Exercise 22 to show that
Use the results of the Exercise 24, and basic properties of variance to show that
In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the size and location of the mean/standard-deviation bar. For selected values of the parameters, run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample mean and variance to the distribution mean and variance.
Suppose that 30 persons are chosen at random.
In the birthday experiment, set . and , run the experiment 1000 times with an update frequency of 10 and compute the relative frequency of the event in part (d) of the last exercise.
Suppose that 10 fair dice are rolled.
In the birthday experiment, set and , run the experiment 1000 times with an update frequency of 10 and compute the relative frequency of the event in part (d) of the last exercise.
A fast food restaurant gives away one of 10 different types of toy with the purchase of each kid's meal. A family buys 15 kid's meals.
In the birthday experiment, set and , run the experiment 1000 times with an update frequency of 10 and compute the relative frequency of the event in part (d).
The lying students problem. Suppose that 3 students, who ride together, miss a mathematics exam. They decide to lie to the instructor by saying that the car had a flat tire. The instructor separates the students and asks each of them which tire was flat. The students, who did not anticipate this, select their answers independently and at random.
The duck hunter problem. Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at random and shoots.