As usual, we start with a random experiment with probability measure \(\P\) on an underlying sample space \(\Omega\). Suppose that \(X\) is a random variable for the experiment, taking values in a set \(S\). The purpose of this section is to study the conditional probability measure given \(X = x\) for \(x \in S\). Thus, if \(E \subseteq \Omega\) is an event for the experiment, we would like to define and study \[\P(E \mid X = x)\] If \(X\) has a discrete distribution, the conditioning event has positive probability, so no new concepts are involved, and the simple definition of conditional probability suffices. When \(X\) has a continuous distribution, however, the conditioning event has probability 0, so a fundamentally new approach is needed.
Suppose first that \(X\) has a discrete distribution with probability density function \(g\). Thus, \(S\) is countable and we can assume that \(g(x) \gt 0\) for \(x \in S\).
If \(E\) is an event in the experiment then \[\P(E \mid X = x) = \frac{\P(E, X = x)}{g(x)}, \quad x \in S\]
The comma separating the events in the numerator of the fraction means and, and thus functions just like the intersection symbol. This result follows immediately from the definition of conditional probability: \[ \P(E \mid X = x) = \frac{\P(E, X = x)}{\P(X = x)} = \frac{\P(E, X = x)}{g(x)} \]
If \(E\) is an event in the experiment and \(A\) is a subset of \(S\) then \[\P(E, X \in A) = \sum_{x \in A} g(x) \P(E \mid X = x)\]
This result is just a special case of the law of total probability. The countable collection of events \( \left\{\{X = x\}: x \in A\right\} \) partitions \( \{X \in A\} \) so \[ \P(E, X \in A) = \sum_{x \in A} \P(E, X = x) = \sum_{x \in A} \P(E \mid X = x) \P(X = x) = \sum_{x \in A} \P(E \mid X = x) g(x) \]
Conversely, the result in the last exercise completely characterizes the conditional distribution given \(\{X = x\}\),
Suppose that the function \(Q(x, E)\), defined for \(x \in S\) and for events \(E\), satisfies \[\P(E, X \in A) = \sum_{x \in A} g(x) Q(x, E)\] Then \(Q(x, E) = \P(E \mid X = x)\) for all \(x \in S\) and all events \(E\).
Let \( x \in S \) and \( A = \{x\} \). Then the equation gives \( \P(E, X = x) = g(x) Q(x, E) \) and hence \( Q(x, E) = \P(E, X = x) \big/ g(x) = \P(E \mid X = x) \).
Suppose now that \(X\) has a continuous distribution on \(S \subseteq \R^n\), with probability density function \(g\). We assume that \(g(x) \gt 0\) for \(x \in S\). Unlike the discrete case, we cannot use simple conditional probability to define the conditional probability of an event \(E\) given \(\{X = x\}\), because the conditioning event has probability 0 for every \(x\). Nonetheless, the concept should make sense. If we actually run the experiment, \(X\) will take on some value \(x\) (even though a priori, this event occurs with probability 0), and surely the information that \(X = x\) should in general alter the probabilities that we assign to other events. A natural approach is to use the results obtained in the discrete case as definitions in the continuous case. Thus, based on the characterization above, we define the conditional probability \[\P(E \mid X = x), \quad x \in S\] by the requirement that for any (measurable) subset \(A\) of \(S\), \[\ P(E, X \in A) = \int_A g(x) \P(E \mid X = x) \, dx \] We will accept the fact that \(\P(E \mid X = x)\) can be defined by this condition. We will return to this point in the section on Conditional Expectation in the chapter on Expected Value.
Again, suppose that \(X\) is a random variable and that \(E\) is an event. From our discussion in the last two subsections, we have the basic formulas for computing the probability of \(E\) by conditioning on \(X\), in the discrete and continuous cases: \[ \begin{align} \P(E) & = \sum_{x \in S} g(x) \P(E \mid X = x) \\ \P(E) & = \int_S g(x) \P(E \mid X = x) \, dx \end{align} \] These formulas are sometimes referred to as the law of total probability. On the other hand, Bayes' Theorem, named after Thomas Bayes, gives a formula for the conditional probability density function of \(X\) given \(E\), in terms of the probability density function of \(X\) and the conditional probability of \(E\) given \(X = x\).
Suppose that \(X\) has probability density function \(g\) and that \(E\) is an event with \(\P(E) \gt 0\). The conditional probability density function of \(X\) given \(E\) is as follows, in the discrete and continuous cases, respectively. \[ \begin{align} g(x \mid E) & = \frac{g(x) \P(E \mid X = x)}{\sum_{s \in S} g(s) \P(E \mid X = s)}, \quad x \in S \\ g(x \mid E) & = \frac{g(x) \P(E \mid X = x)}{\int_S g(s) \P(E \mid X = s) \, ds}, \quad x \in S \end{align} \]
We will give the proof in the continuous case. The discrete case is similar, but simpler. Let \( A \subseteq S \). Recall that \( \P(E) = \int_S g(x) \P(E \mid X = x) \, dx \). Thus \[ \int_A g(x \mid E) \, dx = \int_A \frac{g(x) \P(E \mid X = x)}{\P(E)} = \frac{1}{\P(E)} \int_A g(x) \P(E \mid X = x) \, dx = \frac{\P(E, X \in A)}{\P(E)} = \P(X \in A \mid E) \] By the meaning of density, \( g(x \mid E) \) is the conditional density of \( X \) given \( E \).
In the context of Bayes' theorem, \(g\) is called the prior probability density function of \(X\) and \(x \mapsto g(x \mid E)\) is the posterior probability density function of \(X\) given \(E\). Note also that the conditional probability density function of \(X\) given \(E\) is proportional to the function \(x \mapsto g(x) \P(E \mid X = x)\), the sum or integral of this function that occurs in the denominator is simply the normalizing constant.
The definitions and results above apply, of course, if \(E\) is an event defined in terms of another random variable for our experiment. Thus, suppose that \(Y\) is a random variable taking values in a set \(T\). Then \((X, Y)\) is a random variable taking values in the product set \(S \times T\). We will assume that \((X, Y)\) has joint probability density function \(f\). In particular, we are assuming one of the standard distribution types: jointly discrete, jointly continuous with a probability density function, or mixed components with a probability density function. To simplify the exposition, we will give explicit results in the discrete and continuous cases. Recall that \( X \) has probability density function \( g \) given below, in the discrete and continuous cases: \[ \begin{align} g(x) & = \sum_{y \in T} f(x, y), \quad x \in S \\ g(x) & = \int_T f(x, y) \, dx \quad x \in S \end{align} \] Similary, the probability density function \( h \) of \( Y \) can be obtained by summing \( f \) (in the discrete case) or integrating (in the continuous over \( x \in S \). In the theorems in this subsection, we will give the proofs in the more interesting continuous case. The proofs in the discrete are simpler and again, just use the basic definition of conditional probability.
The function defined below is a probability density function in \(y \in T\) for each \(x \in S\): \[h(y \mid x) = \frac{f(x, y)}{g(x)}, \quad x \in S, \; y \in T\]
The result is simple, since \( g(x) \) is the normalizing constant for \( y \mapsto h(y \mid x) \). Specifically, fix \( x \in S \). Then \( h(y \mid x) \ge 0 \) and
\[ \int_T h(y \mid x) \, dy = \frac{1}{g(x)} \int_T f(x, y) \, dy = \frac{g(x)}{g(x)} = 1\]The next theorem shows that \(y \mapsto h(y \mid x)\) is the conditional probability density function of \(Y\) given \(X = x\).
If \(Y\) has a discrete or continuous distribution, respectively, and \(B \subseteq T\) then \[ \begin{align} \P(Y \in B \mid X = x) & = \sum_{y \in B} h(y \mid x) \\ \P(Y \in B \mid X = x) & = \int_B h(y \mid x) \, dy \end{align} \]
For \( A \subseteq S \), \[ \int_A g(x) \int_B h(y \mid x) \, dy \, dx = \int_A \int_B g(x) h(y \mid x) \, dy \, dx = \int_{A \times B} f(x, y) \, d(x, y) = \P[(X, Y) \in A \times B] = \P(X \in A, Y \in B) \] Hence, the result follows by the defining condition for \( \P(Y \in B \mid X = x) \).
The following theorem gives Bayes' theorem for probability density functions. We use the notation established above, and additionally, let \(x \mapsto g(x \mid y)\) denote the conditional probability density function of \(X\) given \(Y = y\) for \(y \in T\).
If \(X\) has a discrete or continuous distribution, respectively, then \[ \begin{align} g(x \mid y) & = \frac{g(x) h(y \mid x)}{\sum_{s \in S} g(s) h(y \mid s)}, \quad x \in S, \; y \in T \\ g(x \mid y) & = \frac{g(x) h(y \mid x)}{\int_S g(s) h(y \mid s) ds}, \quad x \in S, \; y \in T \end{align}\]
The numerator is \( f(x, y) \) while the denominator is \( \int_S f(y, s) \, ds = h(y) \).
In the context of Bayes' theorem, \(g\) is the prior probability density function of \(X\) and \(x \mapsto g(x \mid y)\) is the posterior probability density function of \(X\) given \(Y = y\). Note that the posterior probability density function \(x \mapsto g(x \mid y)\) is proportional to the function \(x \mapsto g(x) h(y \mid x)\). The sum or integral in the denominator is the normalizing constant.
Intuitively, \(X\) and \(Y\) should be independent if and only if the conditional distributions are the same as the corresponding unconditional distributions.
The following conditions are equivalent:
The equivalence of (a) and (b) was established in the section on joint distributions. The equivalence of (b), (c), and (d) follows from (5).
In the exercises that follow, look for special models and distributions that we have studied. A special distribution may be embedded in a larger problem, as a conditional distribution, for example. In particular, a conditional distribution sometimes arises when a parameter of a standard distribution is randomized.
A couple of special distributions will occur frequently in the exercises. First, recall that the discrete uniform distribution on a finite, nonempty set \(S\) has probability density function \(f\) given by \(f(x) = 1 \big/ \#(S)\) for \(x \in S\). This distribution governs an element selected at random from \(S\).
Recall also that Bernoulli trials (named for Jacob Bernoulli) are independent trials, each with two possible outcomes. In the usual language of reliability, the outcomes are called success and failure. The probability of success \(p\) is the same for each trial, and is the basic parameter of the random process. The number of successes in \(n\) Bernoulli trials has the binomial distribution with parameters \(n\) and \(p\). This distribution has probability density function \(f\) given by \(f(x) = \binom{n}{x} p^x (1 - p)^{n - x}\) for \(x \in \{0, 1, \ldots, n\}\). The binomial distribution is studied in more detail in the chapter on Bernoulli trials
Suppose that two standard, fair dice are rolled and the sequence of scores \((X_1, X_2)\) is recorded. Let \(U = \min\{X_1, X_2\}\) and \(V = \max\{X_1, X_2\}\) denote the minimum and maximum scores, respectively.
\(g(u \mid v)\) | \(u = 1\) | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(v = 1\) | 1 | 0 | 0 | 0 | 0 | 0 |
2 | \(\frac{2}{3}\) | \(\frac{1}{3}\) | 0 | 0 | 0 | 0 |
3 | \(\frac{2}{5}\) | \(\frac{2}{5}\) | \(\frac{1}{5}\) | 0 | 0 | 0 |
4 | \(\frac{2}{7}\) | \(\frac{2}{7}\) | \(\frac{2}{7}\) | \(\frac{1}{7}\) | 0 | 0 |
5 | \(\frac{2}{9}\) | \(\frac{2}{9}\) | \(\frac{2}{9}\) | \(\frac{2}{9}\) | \(\frac{1}{9}\) | 0 |
6 | \(\frac{2}{11}\) | \(\frac{2}{11}\) | \(\frac{2}{11}\) | \(\frac{2}{11}\) | \(\frac{2}{11}\) | \(\frac{1}{11}\) |
\(h(v \mid u)\) | \(u = 1\) | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(v = 1\) | \(\frac{1}{11}\) | 0 | 0 | 0 | 0 | 0 |
2 | \(\frac{2}{11}\) | \(\frac{1}{9}\) | 0 | 0 | 0 | 0 |
3 | \(\frac{2}{11}\) | \(\frac{2}{9}\) | \(\frac{1}{7}\) | 0 | 0 | 0 |
4 | \(\frac{2}{11}\) | \(\frac{2}{9}\) | \(\frac{2}{7}\) | \(\frac{1}{5}\) | 0 | 0 |
5 | \(\frac{2}{11}\) | \(\frac{2}{9}\) | \(\frac{2}{7}\) | \(\frac{2}{5}\) | \(\frac{1}{3}\) | 0 |
6 | \(\frac{2}{11}\) | \(\frac{2}{9}\) | \(\frac{2}{7}\) | \(\frac{2}{5}\) | \(\frac{2}{3}\) | \(\frac{1}{3}\) |
In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let \(N\) denote the die score and \(Y\) the number of heads.
The joint and marginal probability density functions are given in the first table. The conditional probability density function of \(N\) given the different values of \(X\) are recorded in the second table.
\(f(n, y)\) | \(n = 1\) | 2 | 3 | 4 | 5 | 6 | \(h(y)\) |
---|---|---|---|---|---|---|---|
\(y = 0\) | \(\frac{1}{12}\) | \(\frac{1}{24}\) | \(\frac{1}{48}\) | \(\frac{1}{96}\) | \(\frac{1}{102}\) | \(\frac{1}{384}\) | \(\frac{63}{384}\) |
1 | \(\frac{1}{12}\) | \(\frac{1}{12}\) | \(\frac{1}{16}\) | \(\frac{1}{24}\) | \(\frac{5}{192}\) | \(\frac{1}{64}\) | \(\frac{120}{384}\) |
2 | 0 | \(\frac{1}{24}\) | \(\frac{1}{16}\) | \(\frac{1}{16}\) | \(\frac{5}{96}\) | \(\frac{5}{128}\) | \(\frac{99}{384}\) |
3 | 0 | 0 | \(\frac{1}{48}\) | \(\frac{1}{24}\) | \(\frac{5}{96}\) | \(\frac{5}{96}\) | \(\frac{64}{384}\) |
4 | 0 | 0 | 0 | \(\frac{1}{96}\) | \(\frac{5}{192}\) | \(\frac{5}{128}\) | \(\frac{29}{384}\) |
5 | 0 | 0 | 0 | 0 | \(\frac{1}{192}\) | \(\frac{1}{64}\) | \(\frac{8}{384}\) |
6 | 0 | 0 | 0 | 0 | 0 | \(\frac{1}{384}\) | \(\frac{1}{384}\) |
\(g(n)\) | \(\frac{1}{6}\) | \(\frac{1}{6}\) | \(\frac{1}{6}\) | \(\frac{1}{6}\) | \(\frac{1}{6}\) | \(\frac{1}{6}\) | 1 |
\(g(n \mid y)\) | \(n = 1\) | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(y = 0\) | \(\frac{32}{63}\) | \(\frac{16}{63}\) | \(\frac{8}{63}\) | \(\frac{4}{63}\) | \(\frac{2}{63}\) | \(\frac{1}{63}\) |
1 | \(\frac{16}{60}\) | \(\frac{16}{60}\) | \(\frac{12}{60}\) | \(\frac{8}{60}\) | \(\frac{5}{60}\) | \(\frac{3}{60}\) |
2 | 0 | \(\frac{16}{99}\) | \(\frac{24}{99}\) | \(\frac{24}{99}\) | \(\frac{20}{99}\) | \(\frac{15}{99}\) |
3 | 0 | 0 | \(\frac{2}{16}\) | \(\frac{4}{16}\) | \(\frac{5}{16}\) | \(\frac{5}{16}\) |
4 | 0 | 0 | 0 | \(\frac{4}{29}\) | \(\frac{10}{29}\) | \(\frac{15}{29}\) |
5 | 0 | 0 | 0 | 0 | \(\frac{1}{4}\) | \(\frac{3}{4}\) |
6 | 0 | 0 | 0 | 0 | 0 | 1 |
In the die-coin experiment, select the fair die and coin.
In the coin-die experiment, a fair coin is tossed. If the coin is tails, a standard, fair die is rolled. If the coin is heads, a standard, ace-six flat die is rolled (faces 1 and 6 have probability \(\frac{1}{4}\) each and faces 2, 3, 4, 5 have probability \(\frac{1}{8}\) each). Let \(X\) denote the coin score (0 for tails and 1 for heads) and \(Y\) the die score.
The joint and marginal probability density functions are given in the first table below. The conditional probability density functions of \(X\) given the different values of \(Y\) are recorded in the second table.
\(f(x, y)\) | \(y = 1\) | 2 | 3 | 4 | 5 | 6 | \(g(x)\) |
---|---|---|---|---|---|---|---|
\(x = 0\) | \(\frac{1}{12}\) | \(\frac{1}{12}\) | \(\frac{1}{12}\) | \(\frac{1}{12}\) | \(\frac{1}{12}\) | \(\frac{1}{12}\) | \(\frac{1}{2}\) |
1 | \(\frac{1}{8}\) | \(\frac{1}{16}\) | \(\frac{1}{16}\) | \(\frac{1}{16}\) | \(\frac{1}{16}\) | \(\frac{1}{8}\) | \(\frac{1}{2}\) |
\(h(y)\) | \(\frac{5}{24}\) | \(\frac{7}{24}\) | \(\frac{7}{48}\) | \(\frac{7}{48}\) | \(\frac{7}{48}\) | \(\frac{5}{24}\) | 1 |
\(g(x \mid y)\) | \(y = 1\) | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(x = 0\) | \(\frac{2}{5}\) | \(\frac{4}{7}\) | \(\frac{4}{7}\) | \(\frac{4}{7}\) | \(\frac{4}{7}\) | \(\frac{2}{5}\) |
1 | \(\frac{3}{5}\) | \(\frac{3}{7}\) | \(\frac{3}{7}\) | \(\frac{3}{7}\) | \(\frac{3}{7}\) | \(\frac{3}{5}\) |
In the coin-die experiment, select the settings of the previous exercise.
Suppose that a box contains 12 coins: 5 are fair, 4 are biased so that heads comes up with probability \(\frac{1}{3}\), and 3 are two-headed. A coin is chosen at random and tossed 2 times. Let \(P\) denote the probability of heads of the selected coin, and \(X\) the number of heads.
The joint probability density function of \((P, X)\) and the marginal probability density function of \(X\) are given in the first table. The conditional probability density function of \(P\) given the different values of \(X\) are recorded in the second table.
\(f(p, x)\) | \(x = 0\) | 1 | 2 | \(g(p)\) |
---|---|---|---|---|
\(p = \frac{1}{2}\) | \(\frac{5}{48}\) | \(\frac{10}{48}\) | \(\frac{5}{48}\) | \(\frac{5}{12}\) |
\(\frac{1}{3}\) | \(\frac{4}{27}\) | \(\frac{4}{27}\) | \(\frac{1}{27}\) | \(\frac{4}{12}\) |
1 | 0 | 0 | \(\frac{1}{4}\) | \(\frac{3}{12}\) |
\(h(x)\) | \(\frac{109}{432}\) | \(\frac{154}{432}\) | \(\frac{169}{432}\) | 1 |
\(g(p \mid x)\) | \(x = 0\) | 1 | 2 |
---|---|---|---|
\(p = \frac{1}{2}\) | \(\frac{45}{109}\) | \(\frac{45}{77}\) | \(\frac{45}{169}\) |
\(\frac{1}{3}\) | \(\frac{64}{109}\) | \(\frac{32}{77}\) | \(\frac{16}{169}\) |
1 | 0 | 0 | \(\frac{108}{169}\) |
Compare die-coin experiment with the box of coin experiment. In the first experiment, we toss a coin with a fixed probability of heads a random number of times. In the second experiment, we effectively toss a coin with a random probability of heads a fixed number of times.
Suppose that \(P\) has probability density function \(g(p) = 6 p (1 - p)\) for \(0 \le p \le 1\). Given \(P = p\), a coin with probability of heads \(p\) is tossed 3 times. Let \(X\) denote the number of heads.
Compare previous experiment with box of coins experiment. In the previous experiment, we effectively choose a coin from a box with a continuous infinity of coin types. Moreover the prior distribution of \(P\) and each of the posterior distributions of \(P\) in part (c) are members of the family of beta distributions, one of the reasons for the importance of the beta family. Beta distributions are studied in more detail in the chapter on Special Distributions.
Recall that the exponential distribution with rate parameter \(r \gt 0\) has probability density function \(f\) given by \(f(t) = r e^{-r t}\) for \(0 \le t \lt \infty\). The exponential distribution is often used to model random times, under certain assumptions. The exponential distribution is studied in more detail in the chapter on the Poisson Process.
Recall also that the continuous uniform distribution on an interval \([a, b]\), where \(a \lt b\), has probability density function \(f\) given by \(f(x) = \frac{1}{b - a}\) for \(a \le x \le b\). This distribution governs a point selected at random from the interval.
Suppose that there are 5 light bulbs in a box, labeled 1 to 5. The lifetime of bulb \(n\) (in months) has the exponential distribution with rate parameter \(n\). A bulb is selected at random from the box and tested.
Let \(N\) denote the bulb number and \(T\) the lifetime.
\(n\) | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
\(g(n \mid T \gt 1)\) | 0.6364 | 0.2341 | 0.0861 | 0.0317 | 0.0117 |
Suppose that \(X\) is uniformly distributed on \(\{1, 2, 3\}\), and given \(X = x\), \(Y\) is uniformly distributed on the interval \([0, x]\).
Recall that the Poisson distribution with parameter \(a \gt 0\) has probability density function \(g(n) = e^{-a} \frac{a^n}{n!}\) for \(n \in \N\). This distribution is widely used to model the number of random points
in a region of time or space; the parameter \(a\) is proportional to the size of the region. The Poisson distribution is named for Simeon Poisson, and is studied in more detail in the chapter on the Poisson Process.
Suppose that \(N\) is the number of elementary particles emitted by a sample of radioactive material in a specified period of time, and has the Poisson distribution with parameter \(a\). Each particle emitted, independently of the others, is detected by a counter with probability \(p \in (0, 1)\) and missed with probability \(1 - p\). Let \(Y\) denote the number of particles detected by the counter.
The fact that \(Y\) also has a Poisson distribution is an interesting and characteristic property of the distribution. This property is explored in more depth in the section on splitting the Poisson process.
Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 2 (x + y)\) for \(0 \le x \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 15 x^2 y\) for \(0 \le x \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 6 x^2 y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 2 e^{-x} e^{-y}\) for \(0 \le x \le y \lt \infty\).
Suppose that \(X\) is uniformly distributed on the interval \((0, 1)\), and that given \(X = x\), \(Y\) is uniformly distributed on the interval \((0, x)\).
Suppose that \(X\) has probability density function \(g(x) = 3 x^2\) for \(0 \lt x \lt 1\). The conditional probability density function of \(Y\) given \(X = x\) is \(h(y \mid x) = \frac{3 y^2}{x^3}\) for \(0 \lt y \lt x\).
Multivariate uniform distributions give a geometric interpretation of some of the concepts in this section. Recall first that the standard measure \(\lambda_n\) on \(\R^n\) is \[ \lambda_n(A) = \int_A 1 \, dx, \quad A \subseteq \R^n \] In particular
More technically, \( \lambda_n \) is \( n \)-dimensional Lebesgue measure on the measurable subsets of \( \R^n \), and is named for Henri Lebesgue. This should not be of concern if you are a new student of probability. On the other hand, if you are interested in the advanced theory, read the following sections:
Suppose now that \(X\) takes values in \(\R^j\), \(Y\) takes values in \(\R^k\), and that \((X, Y)\) is uniformly distributed on a set \(R \subseteq \R^{j+k}\). Thus, by definition, we must have \( 0 \lt \lambda_{j+k}(R) \lt \infty \) and then the joint probability density function of \((X, Y)\) is \( f(x, y) = 1 \big/ \lambda_{j+k}(R)\) for \( (x, y) \in R\). Now let \(S\) and \(T\) be the projections of \(R\) onto \(\R^j\) and \(\R^k\) respectively, defined as follows: \[ \begin{align} S & = \left\{x \in \R^j: (x, y) \in R \text{ for some } y \in \R^k\right\} \\ T & = \left\{y \in \R^k: (x, y) \in R \text{ for some } x \in \R^j\right\} \end{align} \] Note that \(R \subseteq S \times T\). Next we denote the cross-sections at \(x\) and at \(y\), respectively by \[ \begin{align} T_x & = \{y \in S: (x, y) \in R\}, \quad x \in S \\ S_y & = \{x \in T: (x, y) \in R\}, \quad y \in T \end{align} \]
In the last section on Joint Distributions, we saw that even though \((X, Y)\) is uniformly distributed, the marginal distributions of \(X\) and \(Y\) are not uniform in general. However, as the next theorem shows, the conditional distributions are always uniform.
Suppose that \( (X, Y) \) is uniformly distributed on \( R \). Then
The results are symmetric, so we will prove (a). Recall that \( X \) has PDF \[ g(x) = \int_{T_x} f(x, y) \, dy = \int_{T_x} \frac{1}{\lambda_{j+k}(R)} \, dy = \frac{\lambda_k(T_x)}{\lambda_{j+k}(R)}, \quad x \in S \] Hence for \( x \in S \), the conditional PDF of \( Y \) given \( X = x \) is \[ h(y \mid x) = \frac{f(x, y)}{g(x)} = \frac{1}{\lambda_k(Tx)}, \quad y \in T_x \] and this is the PDF of the uniform distribution on \( T_x \).
Find the conditional density of each variable given a value of the other, and determine if the variables are independent, in each of the following cases:
In the bivariate uniform experiment, run the simulation 1000 times in each of the following cases. Watch the points in the scatter plot and the graphs of the marginal distributions.
Suppose that \((X, Y, Z)\) is uniformly distributed on \(R = \{(x, y, z) \in \R^3: 0 \le x \le y \le z \le 1\}\).
The subscripts 1, 2, and 3 correspond to the variables \( X \), \( Y \), and \( Z \), respectively.
Recall the discussion of the (multivariate) hypergeometric distribution given in the last section on joint distributions. As in that discussion, suppose that a population consists of \(m\) objects, and that each object is one of four types. There are \(a\) objects of type 1, \(b\) objects of type 2, and \(c\) objects of type 3, and \(m - a - b - c\) objects of type 0. The parameters \(a\), \(b\), and \(c\) are nonnegative integers with \(a + b + c \le m\). We sample \(n\) objects from the population at random, and without replacement, where \( n \in \{0, 1, \ldots, m\} \). Denote the number of type 1, 2, and 3 objects in the sample by \(X\), \(Y\), and \(Z\), respectively. Hence, the number of type 0 objects in the sample is \(n - X - Y - Z\). In the problems below, the variables \(x\), \(y\), and \(z\) are nonnegative integers.
The conditional distribution of \((X, Y)\) given \(Z = z\) is hypergeometric and has the probability density function given below.
\[ g(x, y \mid z) = \frac{\binom{a}{x} \binom{b}{y} \binom{m - a - b - c}{n - x - y - z}}{\binom{m - c}{n - z}}, \quad x + y + z \le n\]This result can be proved analytically but a combinatorial argument is better. The essence of the argument is that we are selecting a random sample of size \(n - z\) without replacement from a population of size \(m - c\), with \(a\) objects of type 1, \(b\) objects of type 2, and \(m - a - b\) objects of type 0.
The conditional distribution of \(X\) given \(Y = y\) and \(Z = z\) is hypergeometric, and has the probability density function given below.
\[ g(x \mid y, z) = \frac{\binom{a}{x} \binom{m - a - b - c}{n - x - y - z}}{\binom{m - b - c}{n - y - z}}, \quad x + y + z \le n\]Again, this result can be proved analytically, but a combinatorial argument is better. The essence of the argument is that we are selecting a random sample of size \(n - y - z\) from a population of size \(m - b - c\), with \(a\) objects of type 1 and \(m - a - b - c\) objects type 0.
These results generalize in a completely straightforward way to a population with any number of types. In brief, if a random vector has a hypergeometric distribution, then the conditional distribution of some of the variables, given values of the other variables, is also hypergeometric. Moreover, it is clearly not necessary to remember the hideous formulas in the previous two exercises. You just need to recognize the problem as sampling without replacement from a multi-type population, and then identify the number of objects of each type and the sample size. The hypergeometric distribution and the multivariate hypergeometric distribution are studied in more detail in the chapter on Finite Sampling Models.
In a population of 150 voters, 60 are democrats and 50 are republicans and 40 are independents. A sample of 15 voters is selected at random, without replacement. Let \(X\) denote the number of democrats in the sample and \(Y\) the number of republicans in the sample. Give the probability density function of each of the following:
In the formulas below, the variables \(x\) and \(y\) are nonnegative integers.
Recall that a bridge hand consists of 13 cards selected at random and without replacement from a standard deck of 52 cards. Let \(X\), \(Y\), and \(Z\) denote the number of spades, hearts, and diamonds, respectively, in the hand. Find the probability density function of each of the following:
In the formulas below, the variables \(x\), \(y\), and \(z\) are nonnegative integers.
Recall the discussion of multinomial trials in the last section on joint distributions. As in that discussion, suppose that we have a sequence of independent trials, each with 4 possible outcomes. On each trial, outcome 1 occurs with probability \(p\), outcome 2 with probability \(q\), outcome 3 with probability \(r\), and outcome 0 with probability \(1 - p - q - r\). The parameters \(p\), \(q\), and \(r\) are nonnegative numbers satisfying \(p + q + r \le 1\). Denote the number of times that outcome 1, outcome 2, and outcome 3 occurs in the \(n\) trials by \(X\), \(Y\), and \(Z\) respectively. Of course, the number of times that outcome 0 occurs is \(n - X - Y - Z\). In the problems below, variables \(x\), \(y\), and \(z\) are nonnegative integers.
The conditional distribution of \((X, Y)\) given \(Z = z\) is also multinomial, and has the probability density function given below.
\[ g(x, y \mid z) = \binom{n - z}{x, \; y} \left(\frac{p}{1 - r}\right)^x \left(\frac{q}{1 - r}\right)^y \left(1 - \frac{p}{1 - r} - \frac{q}{1 - r}\right)^{n - x - y - z}, \quad x + y + z \le n\]This result can be proved analytically, but a probability argument is better. First, let \( I \) denote the outcome of a generic trial. Then \( \P(I = 1 \mid I \ne 3) = \P(I = 1) / \P(I \ne 3) = p \big/ (1 - r) \). Similarly, \( \P(I = 2 \mid I \ne 3) = q \big/ (1 - r) \) and \( \P(I = 0 \mid I \ne 3) = (1 - p - q - r) \big/ (1 - r) \). Now, the essence of the argument is that effectively, we have \(n - z\) independent trials, and on each trial, outcome 1 occurs with probability \(p \big/ (1 - r)\) and outcome 2 with probability \(q \big/ (1 - r)\).
The conditional distribution of \(X\) given \(Y = y\) and \(Z = z\) is binomial, with the probability density function given below.
\[ h(x \mid y, z) = \binom{n - y - z}{x} \left(\frac{p}{1 - q - r}\right)^x \left(1 - \frac{p}{1 - q - r}\right)^{n - x - y - z},\quad x + y + z \le n\]Again, this result can be proved analytically, but a probability argument is better. As before, let \( I \) denote the outcome of a generic trial. Then \( \P(I = 1 \mid I \notin \{2, 3\}) = p \big/ (1 - q - r) \) and \( \P(I = 0 \mid I \notin \{2, 3\}) = (1 - p - q - r) \big/ (1 - q - r) \). Thus, the essence of the argument is that effectively, we have \(n - y - z\) independent trials, and on each trial, outcome 1 occurs with probability \(p \big/ (1 - q - r)\).
These results generalize in a completely straightforward way to multinomial trials with any number of trial outcomes. In brief, if a random vector has a multinomial distribution, then the conditional distribution of some of the variables, given values of the other variables, is also multinomial. Moreover, it is clearly not necessary to remember the specific formulas in the previous two exercises. You just need to recognize a problem as one involving independent trials, and then identify the probability of each outcome and the number of trials. The binomial distribution and the multinomial distribution are studied in more detail in the chapter on Bernoulli Trials.
Suppose that peaches from an orchard are classified as small, medium, or large. Each peach, independently of the others is small with probability \(\frac{3}{10}\), medium with probability \(\frac{1}{2}\), and large with probability \(\frac{1}{5}\). In a sample of 20 peaches from the orchard, let \(X\) denote the number of small peaches and \(Y\) the number of medium peaches. Give the probability density function of each of the following:
In the formulas below, the variables \(x\) and \(y\) are nonnegative integers.
For a certain crooked, 4-sided die, face 1 has probability \(\frac{2}{5}\), face 2 has probability \(\frac{3}{10}\), face 3 has probability \(\frac{1}{5}\), and face 4 has probability \(\frac{1}{10}\). Suppose that the die is thrown 50 times. Let \(X\), \(Y\), and \(Z\) denote the number of times that scores 1, 2, and 3 occur, respectively. Find the probability density function of each of the following:
In the formulas below, the variables \(x\), \(y\) and \(z\) are nonnegative integers.
Suppose that \((X, Y)\) has probability density function \[f(x, y) = \frac{1}{12 \pi} \exp\left[-\left(\frac{x^2}{8} + \frac{y^2}{18}\right)\right], \quad (x, y) \in \R^2\]
Suppose that \((X, Y)\) has probability density function
\[f(x, y) = \frac{1}{\sqrt{3} \pi} \exp\left[-\frac{2}{3} (x^2 - x y + y^2)\right], \quad (x, y) \in \R^2\]\(X\) and \(Y\) have the same distribution.
The joint distributions in the last two exercises are examples of bivariate normal distributions. The conditional distributions are also normal. Normal distributions are widely used to model physical measurements subject to small, random errors. The bivariate normal distribution is studied in more detail in the chapter on Special Distributions.
With our usual sets \(S\) and \(T\), as above, suppose that \(P_x\) is a probability measure on \(T\) for each \(x \in S\). Suppose also that \(g\) is a probability density function on \(S\). We can obtain a new probability measure on \(T\) by averaging (or mixing) the given distributions according to \(g\).
First suppose that \(S\) is countable, and that \(g\) is the probability density function of a discrete distribution on \(S\). The function \(\P\) defined below is a probability measure on \(T\): \[ \P(B) = \sum_{x \in S} g(x) P_x(B), \quad B \subseteq T \]
Clearly \( \P(B) \ge 0 \) for \( B \subseteq T \) and \( \P(T) = \sum_{x \in S} g(x) \, 1 = 1 \). Thus suppose that \( \{B_i: i \in I\} \) is a countable, disjoint collection of subsets of \( T \). Then \[ \P\left(\bigcup_{i \in I} B_i\right) = \sum_{x \in S} g(x) P_x\left(\bigcup_{i \in I} B_i\right) = \sum_{x \in S} g(x) \sum_{i \in I} P_x(B_i) = \sum_{i \in I} \sum_{x \in S} g(x) P_x(B_i) = \sum_{i \in I} \P(B_i) \] Reversing the order of summation is justified since the terms are nonnegative.
In the setting of the previous exercise, suppose that \(P_x\) is a discrete (respectively continuous) distribution with probability density function \(h_x\) for each \(x \in S\). Then \(\P\) is also discrete (respectively continuous) with probability density function \(h\) given by \[ h(y) = \sum_{x \in S} g(x) h_x(y), \quad y \in T \]
We will prove the continuous case. For \( B \subseteq T \), \[ \P(B) = \sum_{x \in S} P_x(B) = \sum_{x \in S} \int_B h_x(y) \, dy = \int_B \sum_{x \in S} h_x(y) \, dy = \int_B h(y) \, dy \] Again, the interchange of sum and integral is justified because the functions are nonnegative.
Conversely, given a probability density function \( g \) on \( S \) and a probability density function \( h_x \) on \( T \) for each \( x \in S \), the function \( h \) defined in the previous theorem is a probability density function on \( T \).
Suppose now that \(S \subseteq \R^n\) and that \(g\) is a probability density function of a continuous distribution on \(S\). The function \(\P\) defined below is a probability measure on \(T\): \[ \P(B) = \int_S g(x) P_x(B) dx, \quad B \subseteq T\]
The proof is just like the proof of Theorem 40 with integrals over \( S \) replacing the sums over \( S \).
In the setting of the previous exercise, suppose that \(P_x\) is a discrete (respectively continuous) distribution with probability density function \(h_x\) for each \(x \in S\). Then \(\P\) is also discrete (respectively continuous) with probability density function \(h\) given by \[ h(y) = \int_S g(x) h_x(y) dx, \quad y \in T\]
The proof is just like the proof of Theorem 41 with integrals over \( S \) replacing the sums over \( S \).
In both cases, the distribution \(\P\) is said to be a mixture of the set of distributions \(\{P_x: x \in S\}\), with mixing density \(g\).
One can have a mixture of distributions, without having random variables defined on a common probability space. However, mixtures are intimately related to conditional distributions. Returning to our usual setup, suppose that \(X\) and \(Y\) are random variables for an experiment, taking values in \(S\) and \(T\) respectively. Suppose that \(X\) either has a discrete or continuous distribution, with probability density function \(g\). The following result is simply a restatement of the law of total probability.
The distribution of \(Y\) is a mixture of the conditional distributions of \(Y\) given \(X = x\), over \(x \in S\), with mixing density \(g\).
Suppose that \(X\) is a random variable taking values in \(S \subseteq \R^n\), with a mixed discrete and continuous distribution. The distribution of \(X\) is a mixture of a discrete distribution and a continuous distribution, in the sense defined here.