As usual, we start with a random experiment with probability measure \(\P\) on an underlying sample space. Suppose now that \(X\) and \(Y\) are random variables for the experiment, and that \(X\) takes values in \(S\) while \(Y\) takes values in \(T\). We can think of \((X, Y)\) as a random variable taking values in (a subset of) the product set \(S \times T\). The purpose of this section is to study how the distribution of \((X, Y)\) is related to the distributions of \(X\) and \(Y\) individually. In this context, the distribution of \((X, Y)\) is called the joint distribution, while the distributions of \(X\) and of \(Y\) are referred to as marginal distributions. As always, we assume that the sets and functions that we mention are measurable in the appropriate spaces. If you are a beginning student of probability, you can safely ignore this statement.
More specifically, recall that the distribution of \((X, Y)\) is the probability measure \(C \mapsto \P\left[(X, Y) \in C\right] \) for \(C \subseteq S \times T\). The distribution of \(X\) is the probability measure \( A \mapsto \P(X \in A) \) for \( A \subseteq S \) and the distribution of \( Y \) is the probability measure \( B \mapsto \P(Y \in B) \) for \( B \subseteq T \). The first simple, but very important point, is that the marginal distributions can be obtained from the joint distribution, but not conversely in general.
Note that
If \(X\) and \(Y\) are independent, then by definition, \[\P\left[(X, Y) \in A \times B\right] = \P(X \in A, Y \in B) = \P(X \in A) \P(Y \in B) \quad A \subseteq S, \, B \subseteq T \] and as we have noted before, this completely determines the distribution \((X, Y)\) on \(S \times T\). However, if \(X\) and \(Y\) are dependent, the joint distribution cannot be determined from the marginal distributions. Thus in general, the joint distribution contains much more information than the marginal distributions individually.
Recall that probability distributions are often described in terms of probability density functions. So we need to know how the marginal probability density functions can be obtained from the joint probability density function. The discrete case is easy.
Suppose that \((X, Y)\) has a discrete distribution with probability density function \(f\) on a countable set \(S \times T\). Then \(X\) and \(Y\) have discrete distributions, with probability density functions \(g\) and \(h\), respectively, given by
Note that since \( S \times T \) is countable, \( S \) and \( T \) are countable. The two results are symmetric, so we will prove (a). Note that the countable collection of events \(\left\{ \{Y = y\}: y \in T\right\} \) partitions the sample space. For \( x \in S \), by the countable additivity of probability, \[ \P(X = x) = \sum_{y \in T} \P(X = x, Y = y) = \sum_{y \in T} \P\left[(X, Y) = (x, y)\right] = \sum_{y \in T} f(x, y) \]
For the continuous case, suppose that \(S \subseteq \R^j\) and \(T \subseteq \R^k\) for some \(j, \; k \in \N_+\), so that \(S \times T \subseteq \R^{j + k}\). .
Suppose that \((X, Y)\) has a continuous distribution on \(S \times T\) with probability density function \(f\). Then \(X\) and \(Y\) have continuous distributions with probability density functions \(g\) and \(h\), respectively, given by
Again, the results are symmetric, so we show (a). If \( A \subseteq S \) then \[ \P(X \in A) = \P(X \in A, Y \in T) = \P\left[(X, Y) \in A \times T\right] = \int_{A \times T} f(x, y) \, d(x, y) = \int_A \int_T f(x, y) \, dy \, dx = \int_A g(x) \, dx \] Hence by the very meaning of the term, \( X \) has probability density function \( g \).
In the context of the previous two theorems, \(f\) is called the joint probability density function of \((X, Y)\), while \(g\) and \(h\) are called the marginal density functions of \(X\) and of \(Y\), respectively.
When the variables are independent, the joint density is the product of the marginal densities.
Suppose that \(X\) and \(Y\) are independent, either both with discrete distributions or both with continuous distributions. Suppose that \( X \) has probability density function \(g\) and \( Y \) has probability density function \(h\). Then \((X, Y)\) has probability density function \(f\) given by \[f(x, y) = g(x) h(y), \quad (x, y) \in S \times T\]
In the discrete case, the events \(\{X = x\}\) and \(\{Y = y\}\) are independent for \(x \in S\) and \(y \in T\). Hence \[ \P\left[(X, Y) = (x, y)\right] = \P(X = x, Y = y) = \P(X = x) \P(Y = y) = g(x) g(y) \] For the continuous case, Let \( A \subseteq S \) and \( B \subseteq T \). Then \[ \P\left[(X, Y) \in A \times B\right] = \P(X \in A, Y \in B) = \P(X \in A) \P(Y \in B) = \int_A g(x) \, dx \, \int_B h(y) \, dy = \int_{A \times B} g(x) h(y) \, d(x, y) \] A probability measure on \( S \times T \) is completely determined by its values on product sets (see the advanced section on existence and uniqueness of measures for details), so it follows that \(\P\left[(X, Y) \in C\right] = \int_C f(x, y) \, d(x, y)\) for general \(C \subseteq S \times T\). Hence \( (X, Y) \) has PDF \( f \).
The following result gives a converse to previous theorem. If the joint probability density factors into a function of \(x\) only and a function of \(y\) only, then \(X\) and \(Y\) are independent, and we can almost identify the individual probability density functions just from the factoring.
Suppose that \((X, Y)\) has either a discrete or continuous distribution, with probability density function \(f\). Suppose that \[f(x, y) = u(x) v(y), \quad (x, y) \in S \times T\] where \(u: S \to [0, \infty)\) and \(v: T \to [0, \infty)\). Then \(X\) and \(Y\) are independent, and there exists a positve constant \(c\) such that \(X\) and \(Y\) have probability density functions \(g\) and \(h\), respectively, given by \begin{align} g(x) = & c \, u(x), \quad x \in S \\ h(y) = & \frac{1}{c} v(y), \quad y \in T \end{align}
We will consider the continuous case and leave the discrete case as an exercise. For \( A \subseteq S \) and \( B \subseteq T \), \[ \P(X \in A, Y \in B) = \P\left[(X, Y) \in A \times B\right] = \int_{A \times B} f(x, y) \, d(x, y) = \int_A u(x) \, dx \, \int_B v(y) dy \] Letting \( B = T \) in the displayed equation gives \( \P(X \in A) = \int_A c \, u(x) \, dx \) for \( A \subseteq S \), where \( c = \int_T v(y) \, dy \). It follows that \( X \) has PDF \( g = c \, u \). Next, letting \( A = S \) in the displayed equation gives \( \P(Y \in B) = \int_B k \, v(y) \, dy \) for \( B \subseteq T \), where \( k = \int_S u(x) \, dx \). Thus, \( Y \) has PDF \( g = k \, v \). Next, letting \( A = S \) and \( B =T \) in the displayed equation gives \( 1 = c \, k \), so \( k = 1 / c \). Now note that the displayed equation holds with \( u \) replaced by \( g \) and \( v \) replaced by \( h \), and this in turn gives \( \P(X \in A, Y \in B) = \P(X \in A) \P(Y \in B) \), so \( X \) and \( Y \) are independent.
Again, the results in the last two exercise extend to more than two random variables, because \(X\) and \(Y\) themselves may be random vectors. To state the extension explicitly, suppose that \(X_i\) is a random variable taking values in a set \(R_i\) with probability density funcion \(g_i\) for \(i \in \{1, 2, \ldots, n\}\), and that this collection of random variables is independent. Then the random vector \(\bs{X} = (X_1, X_2, \ldots, X_n)\) taking values in \(S = R_1 \times R_2 \times \cdots \times R_n\) has probability density function \(f\) given by \[f(x_1, x_2, \ldots, x_n) = g_1(x_1) g_2(x_2) \cdots g_n(x_n), \quad (x_1, x_2, \ldots, x_n) \in S\] The special case where \(X_i\) has the same distribution for each \(i \in \{1, 2, \ldots, n\}\) is particularly important. In this case \(R_i = R\) and \(g_i = g\) for each \(i\), so that the probability density function of \(\bs{X}\) on \(S = R^n\) is \[f(x_1, x_2, \ldots, x_n) = g(x_1) g(x_2) \cdots g(x_n), \quad (x_1, x_2, \ldots, x_n) \in S\] In probability jargon, \(\bs{X}\) is a sequence of independent, identically distributed variables, a phrase that comes up so often that it is often abbreviated as IID. In statistical jargon, \(\bs{X}\) is a random sample of size \(n\) from the common distribution. As is evident from the special terminology, this situation is very impotant in both branches of mathematics. In statistics, the joint probability density function \(f\) plays an important role in procedures such as maximum likelihood and the identification of uniformly best estimators.
Recall that (mutual) independence of random variables is a very strong property. If a collection of random variables is independent, then any subcollection is also independent. New random variables formed from disjoint subcollections are independent. For a simple example, suppose that \(X\), \(Y\), and \(Z\) are independent real-valued random variables. Then
In particular, note that statement 2 in the list above is much stronger than the conjunction of statements 4 and 5. Contrapositively, if \(X\) and \(Z\) are dependent, then \((X, Y)\) and \(Z\) are also dependent.
The results of this section have natural analogies in the case that \(X\) and \( Y \) have different distribution types, as discussed in the section on mixed distributions. The results in the subsections above on joint and marginal density functions and independence hold, with sums for the coordinate with the discrete distribution, and integrals for the coordinate with the continuous distribution.
Suppose that two standard, fair dice are rolled and the sequence of scores \((X_1, X_2)\) recorded. Our standard assumption is that the variables \(X_1\) and \(X_2\) are independent. Let \(Y = X_1 + X_2\) and \(Z = X_1 - X_2\) denote the sum and difference of the scores, respectively.
Let \(f\) denote the PDF of \((Y, Z)\), \(g\) the PDF of \(Y\) and \(h\) the PDF of \(Z\). \(Y\) and \(Z\) are dependent
\(f(y, z)\) | \(y = 2\) | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 11 | 12 | \(h(z)\) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
\(z = -5\) | 0 | 0 | 0 | 0 | 0 | \(\frac{1}{36}\) | 0 | 0 | 0 | 0 | 0 | \(\frac{1}{36}\) |
\(-4\) | 0 | 0 | 0 | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | 0 | 0 | 0 | \(\frac{2}{36}\) |
\(-3\) | 0 | 0 | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | 0 | 0 | \(\frac{3}{36}\) |
\(-2\) | 0 | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | 0 | \(\frac{4}{36}\) |
\(-1\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{5}{36}\) |
0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | \(\frac{6}{36}\) |
1 | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{5}{36}\) |
2 | 0 | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | 0 | \(\frac{4}{36}\) |
3 | 0 | 0 | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | 0 | 0 | \(\frac{3}{36}\) |
4 | 0 | 0 | 0 | 0 | \(\frac{1}{36}\) | 0 | \(\frac{1}{36}\) | 0 | 0 | 0 | 0 | \(\frac{2}{36}\) |
5 | 0 | 0 | 0 | 0 | 0 | \(\frac{1}{36}\) | 0 | 0 | 0 | 0 | 0 | \(\frac{1}{36}\) |
\(g(y)\) | \(\frac{1}{36}\) | \(\frac{2}{36}\) | \(\frac{3}{36}\) | \(\frac{4}{36}\) | \(\frac{5}{36}\) | \(\frac{6}{36}\) | \(\frac{5}{36}\) | \(\frac{4}{36}\) | \(\frac{3}{36}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | 1 |
Suppose that two standard, fair dice are rolled and the sequence of scores \((X_1, X_2)\) recorded. Let \(U = \min\{X_1, X_2\}\) and \(V = \max\{X_1, X_2\}\) denote the minimum and maximum scores, respectively.
Let \(f\) denote the PDF of \((U, V)\), \(g\) the PDF of \(U\), and \(h\) the PDF of \(V\). \(U\) and \(V\) are dependent.
\(f(u, v)\) | \(u = 1\) | 2 | 3 | 4 | 5 | 6 | \(h(v)\) |
---|---|---|---|---|---|---|---|
\(v = 1\) | \(\frac{1}{36}\) | 0 | 0 | 0 | 0 | 0 | \(\frac{1}{36}\) |
2 | \(\frac{2}{36}\) | \(\frac{1}{36}\) | 0 | 0 | 0 | 0 | \(\frac{3}{36}\) |
3 | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | 0 | 0 | 0 | \(\frac{5}{36}\) |
4 | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | 0 | 0 | \(\frac{7}{36}\) |
5 | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | 0 | \(\frac{9}{36}\) |
6 | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{11}{36}\) |
\(g(u)\) | \(\frac{11}{36}\) | \(\frac{9}{36}\) | \(\frac{7}{36}\) | \(\frac{5}{36}\) | \(\frac{3}{36}\) | \(\frac{1}{36}\) | 1 |
Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 2 ( x + y)\) for \(0 \le x \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 6 x^2 y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 15 x^2 y\) for \(0 \le x \le y \le 1\).
Suppose that \((X, Y, Z)\) has probability density function \(f(x, y, x) = 2 (x + y) z\) for \(0 \le x \le 1\), \(0 \le y \le 1\), \(0 \le z \le 1\).
We use subscripts for the PDFs: 1, 2, and 3 refer to \(X\), \(Y\), and \(Z\) respectively.
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 2 e^{-x} e^{-y}\) for \(0 \le x \le y \lt \infty\).
In the previous exercise, \( X \) has an exponential distribution with rate parameter 2. Recall that exponential distributions are widely used to model random times, particularly in the context of the Poisson model.
Suppose that \(X\) and \(Y\) are independent, and that \(X\) has probability density function \(g(x) = 6 x (1 - x)\) for \(0 \le x \le 1\), and that \(Y\) has probability density function \(h(y) = 12 y^2 (1 - y)\) for \(0 \le y \le 1\).
Suppose that \(\Theta\) and \(\Phi\) are independent random angles, with common probability density function \(g(t) = \sin(t)\) for \(0 \le t \le \frac{\pi}{2}\).
The common distribution of \( X \) and \( Y \) in the previous exercise governs a random angle in Bertrand's problem.
Suppose that \(X\) and \(Y\) are independent, and that \(X\) has probability density function \(g(x) = \frac{2}{x^3}\) for \(1 \le x \lt \infty\), and that \(Y\) has probability density function \(h(y) = \frac{3}{y^4}\) for \(1 \le y \lt\infty\).
Both \(X\) and \(Y\) in the previous exercise have Pareto distributions, named for Vilfredo Pareto. Recall that Pareto distributions are used to model certain economic variables and are studied in more detail in the chapter on Special Distributions.
Suppose that \((X, Y)\) has probability density function \(g\) given by \(g(x, y) = 15 x^2 y\) for \(0 \le x \le y \le 1\), and that \(Z\) has probability density function \(h\) given by \(h(z) = 4 z^3\) for \(0 \le z \le 1\), and that \((X, Y)\) and \(Z\) are independent.
Multivariate uniform distributions give a geometric interpretation of some of the concepts in this section. Recall first that the standard measure on \(\R^n\) is \[\lambda_n(A) = \int_A 1 dx, \quad A \subseteq \R^n\] In particular
More technically, \( \lambda_n \) is \( n \)-dimensional Lebesgue measure on the measurable subsets of \( \R^n \), and is named for Henri Lebesgue. This should not be of concern if you are a new student of probability. On the other hand, if you are interested in the advanced theory, read the following sections:
Suppose now that \(X\) takes values in \(\R^j\), \(Y\) takes values in \(\R^k\), and that \((X, Y)\) is uniformly distributed on a set \(R \subseteq \R^{j+k}\). Thus, by definition, we must have \() \lt \lambda_{j+k}(R) \lt \infty \), and then the joint probability density function \( f \) of \((X, Y)\) is given by \( f(x, y) = 1 \big/ \lambda_{j+k}(R) \) for \( (x, y) \in R \). Recall that uniform distributions always have constant density functions. Now let \(S\) and \(T\) be the projections of \(R\) onto \(\R^j\) and \(\R^k\) respectively, defined as follows: \begin{align} S & = \left\{x \in \R^j: (x, y) \in R \text{ for some } y \in \R^k\right\} \\ T & = \left\{y \in \R^k: (x, y) \in R \text{ for some } x \in \R^j\right\} \end{align} Note that \(R \subseteq S \times T\). Next we denote the cross-sections at \(x \in S\) and at \(y \in T\), respectively by \begin{align} T_x & = \{y \in T: (x, y) \in R\} \\ S_y & = \{x \in S: (x, y) \in R\} \end{align}
\(X\) takes values in \(S\) and \( Y \) takes values in \( T \). The probability density functions \(g\) and \( h \) of \(X\) and \( Y \) are proportional to the cross-sectional measures:
From our general theory, \( X \) has PDF \( g \) given by \[ g(x) = \int_{T_x} f(x, y) \, dy = \int_{T_x} \frac{1}{\lambda_{j+k}(R)} = \frac{\lambda_k\left(T_x\right)}{\lambda_{j+k}(R)}, \quad x \in S \] The result for \( Y \) is analogous.
In particular, note from previous theorem that \(X\) and \(Y\) are not in general either independent nor uniformly distributed. However, these properties do hold if \(R\) is a Cartesian product set.
Suppose that \(R = S \times T\).
In this case, \( T_x = T \) and \( S_y = S \) for every \( x \in S \) and \( y \in T \). Also, \( \lambda_{j+k}(R) = \lambda_j(S) \lambda_k(T) \), so for \( x \in S \) and \( y \in T \), \( f(x, y) = 1 \big/ \lambda_j(S) \lambda_k(T) \), \( g(x) = 1 \big/ \lambda_j(S) \), \( h(y) = 1 \big/ \lambda_k(T) \).
In each of the following cases, find the joint and marginal probabilit density functions, and determine if \(X\) and \(Y\) are independent.
In the bivariate uniform experiment, run the simulation 1000 times for each of the following cases. Watch the points in the scatter plot and the graphs of the marginal distributions. Interpret what you see in the context of the discussion above.
Suppose that \((X, Y, Z)\) is uniformly distributed on the cube \([0, 1]^3\).
Suppose that \((X, Y, Z)\) is uniformly distributed on \(\{(x, y, z): 0 \le x \le y \le z \le 1\}\).
We use the subscripts 1, 2, and 3 to refer to variables \(X\), \(Y\), and \(Z\) respectively.
The following result shows how an arbitrary continuous distribution can be obtained from a uniform distribution. This result is useful for simulating certain continuous distributions, as we will see.
Suppose that \(g\) is a probability density function for a continuous distribution on \(S \subseteq \R^n\). Let \[R = \{(x, y): x \in S \text{ and } 0 \le y \le g(x)\} \subseteq \R^{n+1} \] If \((X, Y)\) is uniformly distributed on \(R\), then \(X\) has probability density function \(g\).
Note that since \(g\) is a probability density function on \(S\). \[ \lambda_{n+1}(R) = \int_R 1 \, d(x, y) = \int_S \int_0^{g(x)} 1 \, dy \, dx = \int_S g(x) \, dx = 1 \] Hence the probability density function \( f \) of \((X, Y)\) is given by \(f(x, y) = 1\) for \((x, y) \in R\). Thus, the probability density function of \(X\) is \(x \mapsto \int_0^{g(x)} 1 \, dy = g(x)\) for \( x \in S \).
A picture in the case \(n = 1\) is given below:
Suppose now that \( R \subseteq T \) where \( T \subseteq \R_{n+1} \) with \( \lambda_{n+1}(T) \lt \infty \). Note that we also have \( \lambda_{n+1}(T) \ge \lambda_{n+1}(R) = 1 \). Further, suppose that \(\left((X_1, Y_1), (X_2, Y_2), \ldots\right)\) is a sequence of independent random variables with \( X_k \in \R^n \), \( Y_k \in \R \), and \( \left(X_k, Y_k\right) \) uniformly distributed on \( T \) for each \( k \in \N_+ \). Now let \[N = \min\left\{k \in \N_+: \left(X_k, Y_k\right) \in R\right\} = \min\left\{k \in \N_+: X_k \in S, \; 0 \le Y_k \le g\left(X_k\right)\right\}\] the index of the first point to fall in \(R\). Since the sequence is independent, \( N \) has the geometric distribution with success parameter \( p = \lambda_{n+1}(R) \big/ \lambda_{n+1}(T) = 1 \big/ \lambda_{n+1}(T) \), so \( N \) has probability density function \( \P(N = k) = (1 - p)^{k-1} p \) for \( k \in \N_+ \). More importantly, from our past work on independence and the uniform distribution, we know that the first point \(\left(X_N, Y_N\right)\) to fall in \( R \) has a uniform distribution on \(R\), and therefore by the previous result, \(X_N\) has probability density function \(g\).
What's the point of all this? Well, if we can simulate a sequence of independent variables that are uniformly distributed on \( T \), then we can simulate a random variable with the given probability density function \( g \). This method of simulation is known as the rejection method. Suppose in particular that \( R \) is bounded as a subset of \( \R^{n+1} \), which would mean that the domain \( S \) is bounded as a subset of \( \R^n \) and that the probability density function \( g \) is bounded on \( S \). In this case, we can find \( T \) that is the Cartesian product of \( n + 1 \) bounded intervals with \( R \subseteq T \). It turns out to be very easy to simulate a sequence of independent variables, each uniformly distributed on such a product set, so the rejection method always works in this case. As you might guess, the rejection method works best if the size of \( T \), namely \( \lambda_{n+1}(T) \), is small, so that the success parameter \( p \) is large.
The rejection method applet simulates a number of continuous distributions via the rejection method. For each of the following distributions, vary the parameters and note the shape and location of the probability density function. Then run the experiment 1000 times and observe the results.
Suppose that a population consists of \(m\) objects, and that each object is one of four types. There are \(a\) type 1 objects, \(b\) type 2 objects, \(c\) type 3 objects and \(m - a - b - c\) type 0 objects. The parameters \(a\), \(b\), and \(c\) are nonnegative integers with \(a + b + c \le m\). We sample \(n\) objects from the population at random, and without replacement. Denote the number of type 1, 2, and 3 objects in the sample by \(X\), \(Y\), and \(Z\), respectively. Hence, the number of type 0 objects in the sample is \(n - X - Y - Z\). In the problems below, the variables \(x\), \(y\), and \(z\) take values in \(\N\).
\((X, Y, Z)\) has a (multivariate) hypergeometric distribution with probability density function \(f\) given by \[f(x, y, z) = \frac{\binom{a}{x} \binom{b}{y} \binom{c}{z} \binom{m - a - b - c}{n - x - y - z}}{\binom{m}{n}}, \quad x + y + z \le n\]
From the basic theory of combinatorics, the numerator is the number of ways to select an unordered sample of size \( n \) from the population with \( x \) objects of type 1, \( y \) objects of type 2, \( z \) objects of type 3, and \( n - x - y - z \) objects of type 0. The denominator is the total number of ways to select the unordered sample.
\((X, Y)\) also has a (multivariate) hypergeometric distribution, with the probability density function \(g\) given by \[g(x, y) = \frac{\binom{a}{x} \binom{b}{y} \binom{m - a - b}{n - x - y}}{\binom{m}{n}}, \quad x + y \le n\]
This result could be obtained by summing the joint PDF in (26) over \( z \) for fixed \( (x, y) \). However, there is a much nicer combinatorial argument. Note that we are selecting a random sample of size \(n\) from a population of \(m\) objects, with \(a\) objects of type 1, \(b\) objects of type 2, and \(m - a - b\) objects of other types.
\(X\) has an ordinary hypergeometric distribution, with probability density function \(h\) given by \[h(x) = \frac{\binom{a}{x} \binom{m - a}{n - x}}{\binom{m}{n}}, \quad x \le n\]
Again, the result could be obtained by summing the joint PDF in (26) over \( (y, z) \) for fixed \( x \), or by summing the joint PDF in (27) over \( y \) for fixed \( x \). But as before, there is a much more elegant combinatorial argument. Note that we are selecting a random sample of size \(n\) from a population of size \(m\) objects, with \(a\) objects of type 1 and \(m - a\) objects of other types.
These results generalize in a straightforward way to a population with any number of types. In brief, if a random vector has a hypergeometric distribution, then any sub-vector also has a hypergeometric distribution. In other words, all of the marginal distributions of a hypergeometric distribution are themselves hypergeometric. Note however, that it's not a good idea to memorize the formulas above explicitly. It's better to just note the patterns and recall the combinatorial meaning of the binomial coefficient. The hypergeometric distribution and the multivariate hypergeometric distribution are studied in more detail in the chapter on Finite Sampling Models.
Suppose that a population of voters consists of 50 democrats, 40 republicans, and 30 independents. A sample of 10 voters is chosen at random from the population (without replacement, of course). Let \(X\) denote the number of democrats in the sample and \(Y\) the number of republicans in the sample. Find the probability density function of each of the following:
In the formulas below, the variables \(x\) and \(y\) are nonnegative integers.
Suppose that the Math Club at Enormous State University (ESU) has 50 freshmen, 40 sophomores, 30 juniors and 20 seniors. A sample of 10 club members is chosen at random to serve on the \(\pi\)-day committee. Let \(X\) denote the number freshmen on the committee, \(Y\) the number of sophomores, and \(Z\) the number of juniors.
In the formulas below, the variables \(x\), \(y\), and \(z\) are nonnegative integers.
Suppose that we have a sequence of independent trials, each with 4 possible outcomes. On each trial, outcome 1 occurs with probability \(p\), outcome 2 with probability \(q\), outcome 3 with probability \(r\), and outcome 0 occurs with probability \(1 - p - q - r\). The parameters \(p\), \(q\), and \(r\) are nonnegative numbers with \(p + q + r \le 1\). Denote the number of times that outcome 1, outcome 2, and outcome 3 occurred in the \(n\) trials by \(X\), \(Y\), and \(Z\) respectively. Of course, the number of times that outcome 0 occurs is \(n - X - Y - Z\). In the problems below, the variables \(x\), \(y\), and \(z\) take values in \(\N\).
\((X, Y, Z)\) has a multinomial distribution with probability density function \(f\) given by \[f(x, y, z) = \binom{n}{x, \; y, \; z} p^x q^y r^z (1 - p - q - r)^{n - x - y - z}, \quad x + y + z \le n\]
The multinomial coefficient is the number of sequences of length \( n \) with 1 occurring \( x \) times, 2 occurring \( y \) times, 3 occurring \( z \) times, and 0 occurring \( n - x - y - z \) times. The result then follows by independence.
\((X, Y)\) also has a multinomial distribution with the probability density function \(g\) given below. \[g(x, y) = \binom{n}{x, \; y} p^x q^y (1 - p - q)^{n - x - y}, \quad x + y \le n\]
This result could be obtained from the PDF in (31), by summing over \( z \) for fixed \( (x, y) \). However there is a much better direct argument. Note that we have \(n\) independent trials, and on each trial, outcome 1 occurs with probability \(p\), outcome 2 with probability \(q\), and some other outcome with probability \(1 - p - q\).
\(X\) has a binomial distribution, with the probability density function \(h\) given below. \[h(x) = \binom{n}{x} p^x (1 - p)^{n - x}, \quad x \le n\]
Again, the result could be obtained by summing the joint PDF in (31) over \( (y, z) \) for fixed \( x \) or by summing the PDF in (32) over \( y \) for fixed \( x \). But as before, there is a much better direct argument. Note that we have \(n\) independent trials, and on each trial, outcome 1 occurs with probability \(p\) and some other outcome with probability \(1 - p\).
These results generalize in a completely straightforward way to multinomial trials with any number of trial outcomes. In brief, if a random vector has a multinomial distribution, then any sub-vector also has a multinomial distribution. In other terms, all of the marginal distributions of a multinomial distribution are themselves multinomial. The binomial distribution and the multinomial distribution are studied in more detail in the chapter on Bernoulli Trials.
Suppose that a system consists of 10 components that operate independently. Each component is working with probability \(\frac{1}{2}\), idle with probability \(\frac{1}{3}\), or failed with probability \(\frac{1}{6}\). Let \(X\) denote the number of working components and \(Y\) the number of idle components. Give the probability density function of each of the following:
In the formulas below, the variables \(x\) and \(y\) are nonnegative integers.
Suppose that in a crooked, four-sided die, face \(i\) occurs with probability \(\frac{i}{10}\) for \(i \in \{1, 2, 3, 4\}\). The die is thrown 12 times; let \(X\) denote the number of times that score 1 occurs, \(Y\) the number of times that score 2 occurs, and \(Z\) the number of times that score 3 occurs.
In the formulas below, the variables \(x\), \(y\) and \(z\) are nonnegative integers. The subscripts 1, 2, and 3 refer to variables \( X \), \( Y \), and \( Z \) respectively.
Suppose that \((X, Y)\) has probability the density function given below:
\[f(x, y) = \frac{1}{12 \pi} \exp\left[-\left(\frac{x^2}{8} + \frac{y^2}{18}\right)\right], \quad (x, y) \in \R^2\]Suppose that \((X, Y)\) has the probability density function given below:
\[f(x, y) = \frac{1}{\sqrt{3} \pi} \exp\left[-\frac{2}{3}\left(x^2 - x y + y^2\right)\right], \quad(x, y) \in \R^2\]The joint distributions in the last two exercises are examples of bivariate normal distributions. Normal distributions are widely used to model physical measurements subject to small, random errors. In both exercises, the marginal distributions of \( X \) and \( Y \) also have normal distributions, and this turns out to be true in general. The multivariate normal distribution is studied in more detail in the chapter on Special Distributions.
Recall that the exponential distribution has probability density function \[f(x) = r e^{-r t}, \quad 0 \le x \lt \infty\] where \(r \gt 0\) is the rate parameter. The exponential distribution is widely used to model random times, and is studied in more detail in the chapter on the Poisson Process.
Suppose \(X\) and \(Y\) have exponential distributions with parameters \(a\) and \(b\), respectively, and are independent. Then \(\P(X \lt Y) = \frac{a}{a + b}\).
Suppose \(X\), \(Y\), and \(Z\) have exponential distributions with parameters \(a\), \(b\), and \(c\), respectively, and are independent. Then
If \(X\), \(Y\), and \(Z\) are the lifetimes of devices that act independently, then the results in the previous two exercises give probabilities of various failure orders. Results of this type are also very important in the study of continuous-time Markov processes. We will continue this discussion in the section on transformations of random variables.
Suppose \(X\) takes values in the finite set \(\{1, 2, 3\}\), \(Y\) takes values in the interval \([0, 3]\), and that the joint density function \(f\) is given by \[f(x, y) = \begin{cases} \frac{1}{3}, & \quad x = 1, \; 0 \le y \le 1 \\ \frac{1}{6}, & \quad x = 2, \; 0 \le y \le 2 \\ \frac{1}{9}, & \quad x = 3, \; 0 \le y \le 3 \end{cases}\]
Suppose that \(P\) takes values in the interval \([0, 1]\), \(X\) takes values in the finite set \(\{0, 1, 2, 3\}\), and that \((P, X)\) has joint probability density function \(f\) given by \[f(p, x) = 6 \binom{3}{x} p^{x + 1} (1 - p)^{4 - x}, \quad (p, x) \in [0, 1] \times \{0, 1, 2, 3\}\]
As we will see in the section on conditional distributions, the distribution in the last exercise models the following experiment: a random probability \(P\) is selected, and then a coin with this probability of heads is tossed 3 times; \(X\) is the number of heads. Note that \( P \) has a beta distribution.
Recall that the Bernoulli distribution with parameter \(p \in [0, 1]\) has probability density function \(g\) given by \(g(x) = p^x (1 - p)^{1 - x}\) for \(x \in \{0, 1\}\). Let \(\bs{X}\) be a random sample of size \(n\) from the distribution. Give the probability density funcion of \(\bs{X}\) in simplified form.
\(f(x_1, x_2, \ldots, x_n) = p^k (1 - p)^{n-k}\) for \((x_1, x_2, \ldots, x_n) \in \{0, 1\}^n\), where \(k = x_1 + x_2 + \cdots + x_n\)
The Bernoulli distribution is name for Jacob Bernoulli, and governs an indicator random varible. Hence if \(\bs{X}\) is a random sample of size \(n\) from the distribution then \(\bs{X}\) is a sequence of \(n\) Bernoulli trials. A separate chapter studies Bernoulli trials in more detail.
Recall that the geometric distribution on \(\N_+\) with parameter \(p \in (0, 1)\) has probability density function \(g\) given by \(g(x) = p (1 - p)^{x - 1}\) for \(x \in \N_+\). Let \(\bs{X}\) be a random sample of size \(n\) from the distribution. Give the probability density function of \(\bs{X}\) in simplified form.
\(f(x_1, x_2, \ldots, x_n) = p^n (1 - p)^{k-n}\) for \((x_1, x_2, \ldots, x_n) \in \N_+^n\), where \(k = x_1 + x_2 + \cdots + x_n\)
The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials. Hence the variables in the random sample can be interpreted as the number of trials between successive successes.
Recall that the Poisson distribution with parameter \(a \in (0, \infty)\) has probability density function \(g\) given by \(g(x) = e^{-a} \frac{a^x}{x!}\) for \(x \in \N\). Let \(\bs{X}\) be a random sample of size \(n\) from the distribution. Give the probability density funcion of \(\bs{X}\) in simplified form.
\(f(x_1, x_2, \ldots, x_n) = \frac{1}{x_1! x_2! \cdots x_n!} e^{-n a} a^{x_1 + x_2 + \cdots + x_n}\) for \((x_1, x_2, \ldots, x_n) \in \N^n\)
The Poisson distribution is named for Simeon Poisson, and governs the number of random points in a region of time or space under appropriate circumstances. The parameter \( a \) is proportional to the size of the region. The Poisson distribution is studied in more detail in the chapter on the Poisson process.
Recall again that the exponential distribution with rate parameter \(r \in (0, \infty)\) has probability density function \(g\) given by \(g(x) = r e^{-r x}\) for \(x \in (0, \infty)\). Let \(\bs{X}\) be a random sample of size \(n\) from the distribution. Give the probability density funcion of \(\bs{X}\) in simplified form.
\(f(x_1, x_2, \ldots, x_n) = r^n e^{-r (x_1 + x_2 + \cdots + x_n)}\) for \((x_1, x_2, \ldots, x_n) \in [0, \infty)^n\)
The exponential distribution governs failure times and other types or arrival times under appropriate circumstances. The exponential distribution is studied in more detail in the chapter on the Poisson process. The variables in the random sample can be interpreted as the times between successive arrivals in the Poisson process.
Recall that the standard normal distribution has probability density function \(\phi\) given by \(\phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-z^2 / 2}\) for \(z \in \R\). Let \(\bs{Z}\) be a random sample of size \(n\) from the distribution. Give the probability density funcion of \(\bs{Z}\) in simplified form.
\(f(z_1, z_2, \ldots, z_n) = \frac{1}{(2 \pi)^{n/2}} e^{-\frac{1}{2}(z_1^2 + z_2^2 + \cdots + z_n^2)}\) for \((z_1, z_2, \ldots, z_n) \in \R^n\)
The standard normal distribution governs physical quantities, properly scaled and centered, subject to small, random errors. The normal distribution is studied in more generality in the chapter on the Special Distributions.
For the cicada data, \(G\) denotes gender and \(S\) denotes species type.
The empirical joint and marginal empirical densities are given in the table below. Gender and species are probably dependent (compare the joint density with the product of the marginal densities).
\(f(i, j)\) | \(i = 0\) | 1 | \(h(j)\) |
---|---|---|---|
\(j = 0\) | \(\frac{16}{104}\) | \(\frac{28}{104}\) | \(\frac{44}{104}\) |
1 | \(\frac{3}{104}\) | \(\frac{3}{104}\) | \(\frac{6}{104}\) |
2 | \(\frac{40}{104}\) | \(\frac{14}{104}\) | \(\frac{56}{104}\) |
\(g(i)\) | \(\frac{59}{104}\) | \(\frac{45}{104}\) | 1 |
For the cicada data, let \(W\) denote body weight (in grams) and \(L\) body length (in millimeters).
The empirical joint and marginal densities, based on simple partitions of the body weight and body length ranges, are given in the table below. Body weight and body length are almost certainly dependent.
Density \((W, L)\) | \(w \in (0, 0.1]\) | \((0.1, 0.2]\) | \((0.2, 0.3]\) | \((0.3, 0.4]\) | Density \(L\) |
---|---|---|---|---|---|
\(l \in (15, 20]\) | 0 | 0.0385 | 0.0192 | 0 | 0.0058 |
\((20, 25]\) | 0.1731 | 0.9808 | 0.4231 | 0 | 0.1577 |
\((25, 30]\) | 0 | 0.1538 | 0.1731 | 0.0192 | 0.0346 |
\((30, 35]\) | 0 | 0 | 0 | 0.0192 | 0.0019 |
Density \(W\) | 0.8654 | 5.8654 | 3.0769 | 0.1923 |
For the cicada data, let \(G\) denote gender and \(W\) body weight (in grams).
The empirical joint and marginal densities, based on a simple partition of the body weight range, are given in the table below. Body weight and gender are almost certainly dependent.
Density \((W, G)\) | \(w \in (0, 0.1]\) | \((0.1, 0.2]\) | \((0.2, 0.3]\) | \((0.3, 0.4]\) | Density \(G\) |
---|---|---|---|---|---|
\(g = 0\) | 0.1923 | 2.5000 | 2.8846 | 0.0962 | 0.5673 |
1 | 0.6731 | 3.3654 | 0.1923 | 0.0962 | 0.4327 |
Density \(W\) | 0.8654 | 5.8654 | 3.0769 | 0.1923 |