]>
Suppose that we have a dichotomous population . That is, a population that consists of two types of objects, which we will refer to as type 1 and type 0. For example, we could have
Let denote the subset of consisting of the type 1 objects, and suppose that and . As in the basic sampling model, we sample objects at random from . In this section, our only concern is in the types of the objects, so let denote the type of the object chosen (1 or 0). The random vector of types is
Our main interest is the random variable that gives the number of type 1 objects in the sample. Note that is a counting variable, and thus like all counting variables, can be written as a sum of indicator variables, in this case the type variables:
We will assume initially that the sampling is without replacement, which is usually the realistic setting with dichotomous populations.
Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the set of all combinations of size chosen from . This observation leads to a simple combinatorial derivation of the probability density function of .
Show that
This is known as the hypergeometric distribution with parameters , , and .
Show the following alternative form of the hypergeometric probability density function in two ways: combinatorially by treating the outcome as a permutation of size chosen from the population of balls, and algebraically, starting from the result in Exercise 1.
Recall our convention that for . With this convention, the formulas for the probability density function in Exercise 1 and Exercise 2 are correct for . We usually use this simpler set as the set of values for the hypergeometric distribution.
Let . Show that
In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the shape of the probability density function. For selected values of the parameters, run the experiment 1000 times with an update frequency of 10 and watch the apparent convergence of the relative frequency function to the probability density function.
In the following exercises, we will derive the mean and variance of . The exchangeable property of the indicator variables, and properties of covariance and correlation will play a key role.
Show that for each .
Show that .
Show that for any .
Show that for distinct and ,
Note from Exercise 8 that the event of a type 1 object on draw and the event of a type 1 object on draw are negatively correlated, but the correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if . Think about these result intuitively.
Use the results of Exercise 7 and Exercise 8 to show that
Note that if or or . Think about these results.
In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the size and location of the mean/standard deviation bar. For selected values of the parameters, run the experiment 1000 times updating every 10 runs and watch the apparent convergence of the empirical moments to the true moments.
Suppose now that the sampling is with replacement, even though this is usually not realistic in applications.
Show that is a sequence of Bernoulli trials with success parameter .
The following results now follow immediately from the general theory of Bernoulli trials, although modifications of the arguments above could also be used.
Show that has the binomial distribution with parameters and :
Show that
Note that for any values of the parameters, the mean of is the same, whether the sampling is with or without replacement. On the other hand, the variance of is smaller, by a factor of , when the sampling is without replacement than with replacement. Think about these results. The factor is sometimes called the finite population correction factor.
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial probability density function. Note also the difference between the mean/standard deviation bars. For selected values of the parameters and for the two different sampling modes, run the simulation 1000 times, updating every 10 runs.
Suppose that the population size is very large compared to the sample size . In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the hypergeometric distribution should be well approximated by the binomial. The following exercise makes this observation precise. Practically, it is a valuable result, since the binomial distribution has fewer parameters. More specifically, we do not need to know the population size and the number of type 1 objects individually, but only in the ratio .
Suppose that depends on and that as . Show that for fixed , the hypergeometric probability density function with parameters , , and converges to the binomial probability density function with parameters and . Hint: Use the representation in Exercise 2.
The type of convergence in the previous exercise is known as convergence in distribution.
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial probability density function. In particular, note the similarity when is large and small. For selected values of the parameters, and for both sampling modes, run the experiment 1000 times updating every 10 runs.
In the setting of Exercise 15, show that the mean and variance of the hypergeometric distribution converge to the mean and variance of the binomial distribution as .
In many real problems, the parameters or (or both) may be unknown. In this case we are interested in drawing inferences about the unknown parameters based on our observation of , the number of type 1 objects in the sample. We will assume initially that the sampling is without replacement, the realistic setting in most applications.
Suppose that the size of the population is known but that the number of type 1 objects is unknown. This type of problem could arise, for example, if we had a batch of manufactured items containing an unknown number of defective items. It would be too costly and perhaps destructive to test all items, so we might instead select items at random and test those.
A simple estimator of can be derived by hoping that the sample proportion of type 1 objects is close to the population proportion of type 1 objects. That is,
.Show that .
The result in the previous exercise means that is an unbiased estimator of . Hence the variance is a measure of the quality of the estimator, in the mean square sense.
Show that .
Show that for fixed and , as .
Thus, the estimator improves as the sample size increases; this property is known as consistency.
In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment 100 times, updating after each run.
Suppose now that the number of type 1 objects is known, but the population size is unknown. As an example of this type of problem, suppose that we have a lake containing fish where is unknown. We capture of the fish, tag them, and return them to the lake. Next we capture of the fish and observe , the number of tagged fish in the sample. We wish to estimate from this data. In this context, the estimation problem is sometimes called the capture-recapture problem.
Do you think that the main assumption of the sampling model, namely equally likely samples, would be satisfied for a real capture-recapture problem? Explain.
Once again, we can derive a simple estimate of by hoping that the sample proportion of type 1 objects is close the population proportion of type 1 objects. That is,
Thus, our estimator of is if and is undefined if .
In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment 100 times, updating after each run.
Show that if then maximizes as a function of for fixed and . This means that is a maximum likelihood estimator of .
Use Jensen's inequality to show that .
Thus, the estimator is biased and tends to over-estimate . Indeed, if , so that then .
For another approach to estimating , see the section on Order Statistics.
Suppose now that the sampling is with replacement, even though this is unrealistic in most applications. In this case, has the binomial distribution with parameters and .
Show that
Thus, the estimator of with known is still unbiased, but has larger mean square error. Thus, sampling without replacement works better, for any values of the parameters, than sampling with replacement.
In the ball and urn experiment, select sampling with replacement. For selected values of the parameters, run the experiment 100 times, updating after each run.
A batch of 100 computer chips contains 10 defective chips. Five chips are chosen at random, without replacement.
A club contains 50 members; 20 are men and 30 are women. A committee of 10 members is chosen at random.
A small pond contains 1000 fish; 100 are tagged. Suppose that 20 fish are caught.
Forty percent of the registered voters in a certain district prefer candidate . Suppose that 10 voters are chosen at random.
Suppose that 10 memory chips are sampled at random and without replacement from a batch of 100 chips. The chips are tested and 2 are defective. Estimate the number of defective chips in the entire batch.
A voting district has 5000 registered voters. Suppose that 100 voters are selected at random and polled, and that 40 prefer candidate . Estimate the number of voters in the district who prefer candidate .
From a certain lake, 200 fish are caught, tagged and returned to the lake. Then 100 fish are caught and it turns out that 10 are tagged. Estimate the population of fish in the lake.
Recall that the general card experiment is to select cards at random and without replacement from a standard deck of 52 cards. The special case is the poker experiment and the special case is the bridge experiment.
In a poker hand, find the probability density function, mean, and variance of the following random variables:
In a bridge hand, find the probability density function, mean, and variance of the following random variables:
An interesting thing to do in almost any parametric probability model is to randomize one or more of the parameters. Done in the right way, this often leads to an interesting new parametric model, since the distribution of the randomized parameter will often itself belong to a parametric family. This is also the natural setting to apply Bayes' theorem.
In this section, we will randomize the number of type 1 objects in the basic hypergeometric model. Specifically, we assume that we have objects in the population, as before. However, instead of a fixed number of type 1 objects, we assume that each of the objects in the population, independently of the others, is type 1 with probability and type 0 with probability . We have eliminated one parameter, , in favor of a new parameter with values in the interval . Let denote the type of the object in the population, so that is a sequence of Bernoulli trials with success parameter . Let denote the number of type 1 objects in the population, so that has the binomial distribution with parameters and .
As before, we sample object from the population. Again we let denote the type of the object sampled, and we let denote the number of type 1 objects in the sample. We will consider sampling with and without replacement. In the first case, the sample size can be any positive integer, but in the second case, the sample size cannot exceed the population size. The key technique in the analysis of the randomized urn is to condition on . If we know that , then the model reduces to the model studied above: a population of size with type 1 objects, and a sample of size .
Show that with either type of sampling,
Thus, in either model, is a sequence of identically distributed indicator variables. Ah, but what about dependence?
Suppose that the sampling is without replacement. Let and let Show that
From the joint distribution in the previous exercise, we see that is a sequence of Bernoulli trials with success parameter , and hence has the binomial distribution with parameters and . We could also argue that is a Bernoulli trials sequence directly, by noting that is a randomly chosen subset of .
Suppose now that the sampling is with replacement. Again, let and let Show that
A closed form expression for the joint distribution of , in terms of the parameters , , and is not easy, but it is at least clear that the joint distribution will not be the same as the one when the sampling is without replacement. Thus, is a dependent sequence. Note however that is an exchangeable sequence, since the joint distribution is invariant under a permutation of the coordinates (this is a simple consequence of the fact that the joint distribution depends only on the sum ).
Note that
Let's compute the covariance and correlation of a pair of type variables when the sampling is with replacement. Suppose that and are distinct indices. Show that
Now we can get the mean and variance of . Show that
Let's conclude with an interesting observation: For the randomized urn, is a sequence of independent variables when the sampling is without replacement but a sequence of dependent variables when the sampling is with replacement--just the opposite of the situation for the deterministic urn with a fixed number of type 1 objects.