]> The Multivariate Hypergeometric Distribution
  1. Virtual Laboratories
  2. 11. Finite Sampling Models
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9

3. The Multivariate Hypergeometric Distribution

Basic Theory

As in the basic sampling model, we start with a finite population D consisting of m objects. In this section, we suppose in addition that each object is one of k types; that is, we have a multi-type population. For example, we could have an urn with balls of several different colors, or a population of voters who are either democrat, republican, or independent. Let D i denote the subset of all type i objects and let m i D i for i 1 2 k . Thus

D i 1 k D i ,  m i 1 k m i

The dichotomous model considered earlier is clearly a special case, with k 2 . As in the basic sampling model, we sample n objects at random from D :

X X 1 X 2 X n

where X i D is the i object chosen. Now let Y i denote the number of type i objects in the sample, for i 1 2 k . Note that

i 1 k Y i n

so if we know the values of k 1 of the counting variables, we can find the value of the remaining counting variable. As with any counting variable, we can express Y i as a sum of indicator variables:

Show that

Y i j 1 n I i j  where  I i j 1 X j D i 0 X j D i

We assume initially that the sampling is without replacement, since this is the realistic case in most applications.

The Joint Distribution

Basic combinatorial arguments can be used to derive the probability density function of the random vector of counting variables. Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the combinations of size n chosen from D .

Show that

Y 1 j 1 Y 2 j 2 Y k j k m 1 j 1 m 2 j 2 m k j k m n  for  j 1 j 2 j k k  with  i 1 k j i n

The distribution of Y 1 Y 2 Y k is called the multivariate hypergeometric distribution with parameters m , m 1 m 2 m k , and n . We also say that Y 1 Y 2 Y k 1 has this distribution (recall again that the values of any k 1 of the variables determines the value of the remaining variable). Usually it is clear from context which meaning is intended. The ordinary hypergeometric distribution corresponds to k 2 .

Show the following alternate from of the multivariate hypergeometric probability density function in two ways: combinatorially, by considering the ordered sample uniformly distributed over the permutations of size n chosen from D , and algebraically, starting with the result in Exercise 2.

Y 1 j 1 Y 2 j 2 Y k j k n j 1 j 2 j k m 1 j 1 m 2 j 2 m k j k m n  for  j 1 j 2 j k k  with  i 1 k j i n

The Marginal Distributions

Show that Y i has the hypergeometric distribution with parameters m , m i . and n . Give both a probabilistic argument, based on the sampling model, and an analytic derivation, based on the joint probability density function in Exercise 2.

Y i j m i j m m i n j m n ,  j 0 1 n

Grouping

The multivariate hypergeometric distribution is preserved when the counting variables are combined. Specifically, suppose that A 1 A 2 A l is a partition of the index set 1 2 k into nonempty, disjoint subsets. Let

W j i A j Y i ,  r j i A j m i  for  j 1 2 l

Show that W 1 W 2 W l has the multivariate hypergeometric distribution with parameters m , r 1 r 2 r l , and n .

Conditioning

The multivariate hypergeometric distribution is also preserved when some of the counting variables are observed. Specifically, suppose that A B is a partition of the index set 1 2 k into nonempty, disjoint subsets . Suppose that we observe Y j y j for j B . Let

z j B y j ,  r i A m i

Show that the conditional distribution of Y i i A given Y j y j j B is multivariate hypergeometric with parameters r , m i i A , and z .

Combinations of the basic results in Exercise 5 and Exercise 6 can be used to compute any marginal or conditional distributions of the counting variables.

Moments

We will compute the mean, variance, covariance, and correlation of the counting variables. Results from the hypergeometric distribution and the representation in terms of indicator variables in Exercise 1 are the main tools.

Show that for i 1 2 k ,

  1. Y i n m i m
  2. Y i n m i m 1 m i m m n m 1

Suppose that i and j are distinct elements of 1 2 k and that r and s are distinct elements of 1 2 n . Show that

  1. I i r I j r m i m j m 2
  2. I i r I j s m i m j m 2 m 1

Suppose that i and j are distinct elements of 1 2 k and that r and s are distinct elements of 1 2 n . Show that

  1. I i r I j r m i m j m m i m m j
  2. I i r I j s m i m j m m i m m j m 1

In particular, I i r and I j s are negatively correlated for distinct i and j , and for any r and s . Does this result seem reasonable?

Use the result of Exercise 7 and Exercise 8 to show that for distinct i and j in 1 2 k ,

  1. Y i Y j n m i m j m 2 m n m 1
  2. Y i Y j m i m j m m i m m j

Sampling with Replacement

Suppose now that the sampling is with replacement, even though this is usually not realistic in applications.

Show that the types of the objects in the sample form a sequence of n multinomial trials with parameters m 1 m m 2 m m k m .

The following results now follow immediately from the general theory of multinomial trials, although modifications of the arguments above could also be used.

Show that Y 1 Y 2 Y k has the multinomial distribution with parameters n and m 1 m m 2 m m k m :

Y 1 j 1 Y 2 j 2 Y k j k n j 1 j 2 j k m 1 j 1 m 2 j 2 m k j k m n  for  j 1 j 2 j k k  with  i 1 k j i n

Show that for distinct i and j in 1 2 k ,

  1. Y i n m i m
  2. Y i n m i m 1 m i m
  3. Y i Y j n m i m j m 2
  4. Y i Y j m i m j m m i m m j

Convergence to the Multinomial Distribution

Suppose that the population size m is very large compared to the sample size n . In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the multivariate hypergeometric distribution should be well approximated by the multinomial. The following exercise makes this observation precise. Practically, it is a valuable result, since in many cases we do not know the population size exactly.

Suppose that m i depends on m and that m i m p i as m for i 1 2 k . Show that for fixed n , the multivariate hypergeometric probability density function with parameters m , m 1 m 2 m k , and n converges to the multinomial probability density function with parameters n and p 1 p 2 p k . Hint: Use the representation in Exercise 3.

Examples and Applications

A population of 100 voters consists of 40 republicans, 35 democrats and 25 independents. A random sample of 10 voters is chosen.

  1. Find the joint density function of the number of republicans, number of democrats, and number of independents in the sample
  2. Find the mean of each variable in (a).
  3. Find the variance of each variable in (a).
  4. Find the covariance of each pair of variables in (a).
  5. Find the probability that the sample contains at least 4 republicans, at least 3 democrats, and at least 2 independents.

Cards

Recall that the general card experiment is to select n cards at random and without replacement from a standard deck of 52 cards. The special case n 5 is the poker experiment and the special case n 13 is the bridge experiment.

In a bridge hand, find the probability density function of

  1. the number of spades, number of hearts, and number of diamonds.
  2. the number of spades and number of hearts.
  3. the number of spades.
  4. the number of red cards and the number of black cards.

In a bridge hand,

  1. Find the mean and variance of the number of spades.
  2. Find the covariance and correlation between the number of spades and the number of hearts.
  3. Find the mean and variance of the number of red cards.

In a bridge hand,

  1. Find the conditional probability density function of the number of spades and the number of hearts, given that the hand has 4 diamonds.
  2. Find the conditional probability density function of the number of spades given that the hand has 3 hearts and 2 diamonds.

In the card experiment, a hand that does not contain any cards of a particular suit is said to be void in that suit.

Use the inclusion-exclusion rule to show that the probability that a poker hand is void in at least one suit is

19134962598960 0.736

In the card experiment, set n 5 . Run the simulation 1000 times, updating after each run. Compute the relative frequency of the event that the hand is void in at least one suit. Compare the relative frequency with the true probability given in the previous exercise.

Use the inclusion-exclusion rule to show that the probability that a bridge hand is void in at least one suit is

32427298180635013559600 0.051