Recall that by taking the expected value of various transformations of a random variable, we can measure many interesting characteristics of the distribution of the variable. In this section, we will study an expected value that measures a special type of relationship between two real-valued variables. This relationship is very important both in probability and statistics.
As usual, our starting point is a random experiment with probability measure \(\P\) on an underlying sample space. Of course, we assume that all expected values mentioned in this section exist. Suppose that \(X\) and \(Y\) are real-valued random variables for the experiment with means \(\E(X)\), \(\E(Y)\) and variances \(\var(X)\), \(\var(Y)\), respectively. The covariance of \((X, Y)\) is defined by
\[ \cov(X, Y) = \E\left([X - \E(X)][Y - \E(Y)]\right) \]and (assuming the variances are positive, so that the random variables really are random) the correlation of \( (X, Y)\) is defined by
\[ \cor(X, Y) = \frac{\cov(X, Y)}{\sd(X) \sd(Y)} \]Correlation is a scaled version of covariance; note that the two parameters always have the same sign (positive, negative, or 0). When the sign is positive, the variables are said to be positively correlated; when the sign is negative, the variables are said to be negatively correlated; and when the sign is 0, the variables are said to be uncorrelated. Note also that correlation is dimensionless, since the numerator and denominator have the same physical units, namely the product of the units of \(X\) and \(Y\).
As these terms suggest, covariance and correlation measure a certain kind of dependence between the variables. One of our goals is a deep understanding of this dependence. As a start, note that \((\E(X), \E(Y))\) is the center of the joint distribution of \((X, Y)\), and the vertical and horizontal lines through this point separate \(\R\) into four quadrants. The function \((x, y) \mapsto [x - \E(X)][y - \E(Y)]\) is positive on the first and third of these quadrants and negative on the second and fourth.
The following exercises give some basic properties of covariance. The main tool that we will need is the fact that expected value is a linear operation. Other important properties will be derived below, in the subsection on the best linear predictor.
Our first result is a formula that is better than the definition for computational purposes
\(\cov(X, Y) = \E(X \, Y) - \E(X) \, \E(Y)\).
Let \( \mu = \E(X) \) and \( \nu = \E(Y) \). Then
\[ \cov(X, Y) = \E[(X - \mu)(Y - \nu)] = \E(X Y - \mu Y - \nu X + \mu \nu) = \E(X Y) - \mu \E(Y) - \nu \E(X) + \mu \nu = \E(X Y) - \mu \nu \]By Exercise 1, we see that \(X\) and \(Y\) are uncorrelated if and only if \(\E(X \, Y) = \E(X) \, \E(Y)\). In particular, if \(X\) and \(Y\) are independent, then they are uncorrelated. However, the converse fails with a passion: Exericse 30 gives an example of two variables that are functionally related (the strongest form of dependence), yet uncorrelated. The computational exercises give other examples of dependent yet uncorrelated variables also.
Trivially, covariance is a symmetric operation.
\(\cov(X, Y) = \cov(Y, X)\).
As the name suggests, covariance generalizes variance.
\(\cov(X, X) = \var(X)\).
Let \( \mu = \E(X) \). Then \( \cov(X, X) = \E[(X - \mu)^2] = \var(X) \).
Covariance is a linear operation in the first argument, if the second argument is fixed.
If \(X\), \(Y\), \(Z\) are real-valued random variables for the experiment, and \(c\) is a constant, then
We use the formula in Theorem 1. For part (a),
\[ \begin{align} \cov(X + Y, Z) & = \E[(X + Y) Z] - \E(X + Y) \E(Z) = \E(X Z + Y Z) - [\E(X) + \E(Y)] \E(Z) \\ & = [\E(X Z) - \E(X) \E(Z)] + [\E(Y Z) - \E(Y) \E(Z)] = \cov(X, Z) + \cov(Y, Z) \end{align} \]For part (b),
\[ \cov(c X, Y) = \E(c X Y) - \E(c X) \E(Y) = c \E(X Y) - c \E(X) \E(Y) = c [\E(X Y) - \E(X) \E(Y) = c \cov(X, Y) \]By symmetry, covariance is also a linear operation in the second argument, with the first argument fixed. Thus, the covariance operator is bi-linear. The general version of this property is given in the following theorem.
Suppose that \((X_1, X_2, \ldots, X_n)\) and \((Y_1, Y_2, \ldots, Y_m)\) are sequences of real-valued random variables for the experiment, and that \((a_1, a_2, \ldots, a_n)\) and \((b_1, b_2, \ldots, b_m)\) are constants. Then
\[ \cov\left(\sum_{i=1}^n a_i \, X_i, \sum_{j=1}^m b_j \, Y_j\right) = \sum_{i=1}^n \sum_{j=1}^m a_i \, b_j \, \cov(X_i, Y_j) \]The following result shows how covariance is changed under a linear transformation of one of the variables. This is an important special case of the basic properties.
If \( a, \; b \in \R \) then \(\cov(a + bX, Y) = b \, \cov(X, Y)\).
A constant is independent of any random variable. Hence \( \cov(a + b X, Y) = \cov(a, Y) + b \, \cov(X, Y) = b \, \cov(X, Y) \).
Next we will establish some basic properties of correlation. Most of these follow easily from corresponding properties of covariance. We assume that \(\var(X) \gt 0\) and \(\var(Y) \gt 0\), so that the random variable really are random and hence the correlation is well defined.
The correlation between \(X\) and \(Y\) is simply the expected product of the corresponding standard scores:
\[ \cor(X, Y) = \E\left(\frac{X - \E(X)}{\sd(X)} \frac{Y - \E(Y)}{\sd(Y)}\right) \]From the definitions and the linearity of expected value,
\[ \cor(X, Y) = \frac{\cov(X, Y)}{\sd(X) \sd(Y)} = \frac{\E([X - \E(X)][Y - \E(Y)])}{\sd(X) \sd(Y)} = \E\left(\frac{X - \E(X)}{\sd(X)} \frac{Y - \E(Y)}{\sd(Y)}\right) \]This shows again that correlation is dimensionless, since of course, the standard scores are dimensionless. Also, correlation is symmetric:
\(\cor(X, Y) = \cor(Y, X)\).
Under a linear transformation of one of the variables, the correlation is unchanged if the slope is positve and changes sign if the slope is negative:
If \(a, \; b \in \R\) and \( b \ne 0 \) then
Let \( Z \) denote the standard score of \( X \). If \( b \gt 0 \), the standard score of \( a + b X \) is also \( Z \). If \( b \lt 0 \), the standard score of \( a + b X \) is \( -Z \). Hence the result follows from Theorem 7.
This result reinforces the fact that correlation is a standardized measure of association, since multiplying the variable by a positive constant is equivalent to a change of scale, and adding a contant to a variable is equivalent to a change of location. For example, in the Challenger data, the underlying variables are temperature at the time of launch (in degrees Fahrenheit) and O-ring erosion (in millimeters). The correlation between these two variables is of fundamental importance. If we decide to measure temperature in degrees Celsius and O-ring erosion in inches, the correlation is unchanged.
The most important properties of covariance and correlation will emerge from our study of the best linear predictor below.
We will now show that the variance of a sum of variables is the sum of the pairwise covariances. This result is very useful since many random variables with common distributions can be written as sums of simpler random variables (see in particular the binomial distribution and hypergeometric distribution below).
If \((X_1, X_2, \ldots, X_n)\) is a sequence of real-valued random variables for the experiment, then
\[ \var\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \sum_{j=1}^n \cov(X_i, X_j) = \sum_{i=1}^n \var(X_i) + 2 \, \sum_{\{\{i, j\}: \; i \lt j\}} \cov(X_i, X_j) \]From Theorem 3, and Theorem 5.
\[ \var\left(\sum_{i=1}^n X_i\right) = \cov\left(\sum_{i=1}^n X_i, \sum_{j=1}^n X_j\right) = \sum_{i=1}^j \sum_{j=1}^n \cov(X_i, X_j) \]The second expression follows since \( \cov(X_i, X_i) = \var(X_i) \) for each \( i \) and \( \cov(X_i, X_j) = \cov(X_j, X_i) \) for \( i \ne j \).
Note that the variance of a sum can be greater, smaller, or equal to the sum of the variances, depending on the pure covariance terms. As a special case of Exercise 11, when \(n = 2\), we have
\[ \var(X + Y) = \var(X) + \var(Y) + 2 \, \cov(X, Y) \]Note that the result in the previous exercise holds, in particular, if the random variables are independent.
If \(X\) and \(Y\) are real-valued random variables then \(\var(X + Y) + \var(X - Y) = 2 \, [\var(X) + \var(Y)]\).
From Theorem 11, \( \var(X + Y) = \var(X) + \var(Y) + 2 \cov(X, Y) \). Similarly,
\[ \var(X - Y) = \var(X) + \var(-Y) + 2 \cov(X, - Y) = \var(X) + \var(Y) - 2 \cov(X, Y) \]Adding gives the result.
If \(X\) and \(Y\) are real-valued random variables with \(\var(X) = \var(Y)\) then \(X + Y\) and \(X - Y\) are uncorrelated.
From the bilinearity and symmetry properties, \( \cov(X + Y, X - Y) = \cov(X, X) - cov(X, Y) + \cov(Y, X) - \cov(Y, Y) = \var(X) - \var(Y) \)
In the following exercises, suppose that \((X_1, X_2, \ldots)\) is a sequence of independent, real-valued random variables with a common distribution that has mean \(\mu\) and standard deviation \(\sigma \gt 0\). In statistical terms, the variables form a random sample from the common distribution.
Let \(Y_n = \sum_{i=1}^n X_i\).
Part (a) follows from the additivity of expected value. Part (b) follows from the additivity of variance for independent variables (Theorem 12).
Let \(M_n = Y_n / n = \frac{1}{n} \sum_{i=1}^n X_i\). Thus, \(M_n\) is the sample mean.
Parts (a) and (b) follow from the previous theorem and basic properties of expected value and variance. Specifically, \( \E(M_n) = \E(Y_n) / n = n \mu / n = \mu \) and \( \var(M_n) = \var(Y_n) / n^2 = n \sigma^2 / n^2 = \sigma^2 / n \). Part (c) follows immediatley from part (b). Finally, part (d) follows from part (c) and Chebyshev's inequality: \( \P(|M_n - \mu| \gt \epsilon) \le \var(M_n) / \epsilon^2 \to 0 \) as \( n \to \infty \).
Part (c) of the last exercise means that \(M_n \to \mu\) as \(n \to \infty\) in mean square. Part (d) means that \(M_n \to \mu\) as \(n \to \infty\) in probability. These are both versions of the weak law of large numbers, one of the fundamental theorems of probability.
The standard score of the sum \( Y_n \) and the standard score of the sample mean \( M_n \) are the same:
\[ Z_n = \frac{Y_n - n \, \mu}{\sqrt{n} \, \sigma} = \frac{M_n - \mu}{\sigma / \sqrt{n}} \]The equality of the standard score of \( Y_n \) and of \( Z_n \) is a result of simple algebra. But recall more generally that the standard score of a variable is unchanged by a linear transformation of the variable with positive slope. Of course, parts (a) and (b) are true for any standard score.
The central limit theorem, the other fundamental theorem of probability, states that the distribution of \(Z_n\) converges to the standard normal distribution as \(n \to \infty\).
Suppose that \(A\) and \(B\) are events in a random experiment. The covariance and correlation of \(A\) and \(B\) are defined to be the covariance and correlation, respectively, of their indicator random variables \(\bs{1}_A\) and \(\bs{1}_B\).
If \(A\) and \(B\) are events then
Recall that if \( X \) is an indicator variable with \( \P(X = 1) = p \), then \( \E(X) = p \) and \( \var(X) = p (1 - p) \). Also, if \( X \) and \( Y \) are indicator variables then \( X Y \) is an indicator variable and \( \P(X Y = 1) = \P(X = 1, Y = 1) \).
In particular, note that \(A\) and \(B\) are positively correlated, negatively correlated, or independent, respectively (as defined in the section on conditional probability) if and only if the indicator variables of \(A\) and \(B\) are positively correlated, negatively correlated, or uncorrelated, as defined in this section.
If \(A\) and \(B\) are events then
Note that \( \bs{1}(A^c) = 1 - \bs{1}(A) \).
If \(A \subseteq B\) then
These results follow from Theorem 18 since \( A \cap B = A \).
What linear function of \(X\) is closest to \(Y\) in the sense of minimizing mean square error? The question is fundamentally important in the case where random variable \(X\) (the predictor variable) is observable and random variable \(Y\) (the response variable) is not. The linear function can be used to estimate \(Y\) from an observed value of \(X\). Moreover, the solution will show that covariance and correlation measure the linear relationship between \(X\) and \(Y\). To avoid trivial cases, let us assume that \(\var(X) \gt 0\) and \(\var(Y) \gt 0\), so that the random variables really are random.
Let \(\mse(a, b)\) denote the mean square error when \(a + b \, X\) is used as an estimator of \(Y\) (as a function of the parameters \(a, \; b \in \R\):
\[ \mse(a, b) = \E\left([Y - (a + b \, X)]^2 \right) \]\(\mse(a, b)\) is minimized when
\[ b = \frac{\cov(X, Y)}{\var(X)}, \quad a = \E(Y) - \frac{\cov(X, Y)}{\var(X)} \E(X) \]In the definition of the mean square error function, expanding the square and using the linearity of expected value gives \[ \mse(a, b) = \E(Y^2) - 2 \: b \: \E(X \: Y) - 2 \: a \: \E(Y) + b^2 \: \E(X^2) + 2 \: a \: b \: \E(X) + a^2 \]
Setting the first derivatives of \( \mse \) to 0 we have
\[ \begin{align} -2 \E(Y) + 2 b \E(X) + 2 a & = 0 \\ -2 \E(X Y) + 2 b \E(X^2) + 2 a \E(X) & = 0 \end{align} \]Solving the first equation for \( a \) gives \( a = \E(Y) - b \E(X) \). Substituting this into the second equation and solving gives \( b = \cov(X, Y) / \var(X) \). Finally, the second derivate matrix is
\[ \left[ \begin{matrix} 2 & 2 \E(X) \\ 2 \E(X) & 2 \E(X^2) \end{matrix} \right] \]The diagonal entries are postive and the determinant is \( 4 \var(X) \gt 0 \) so the matrix is positive definite. It follows that the minimum of \( \mse \) occurs at the single critical point.
Thus, the best linear predictor of \(Y\) given \(X\) is the random variable \(L(Y \mid X)\) given by
\[ L(Y \mid X) = \E(Y) + \frac{\cov(X, Y)}{\var(X)} [X - \E(X)] \]The minimum value of the mean square error function \(\mse\), is
\[ \E\left([Y - L(Y \mid X)]^2 \right) = \var(Y)[1 - \cor^2(X, Y)] \]This follows from substituting \( a = \E(Y) - \E(X) \cov(X, Y) / \var(X) \) and \( b = \cov(X, Y) / \var(X) \) into \( \mse(a, b) \) and simpliftying.
Our solution to the best linear perdictor problems yields important properties of covariance and correlation.
Correlation satisfies the following propeties:
Since mean square error is nonnegative, it follows from the previous theorem that \(\cor^2(X, Y) \le 1\). This gives parts (a) and (b). For parts (c) and (d), note that if \(\cor^2(X, Y) = 1\) then \(Y = L(Y \mid X)\) with probability 1, and that the slope in \( L(Y \mid X) \) has the same sign as \( \cor(X, Y) \).
The last two theorems show clearly that \(\cov(X, Y)\) and \(\cor(X, Y)\) measure the linear association between \(X\) and \(Y\).
Recall from our previous discussion of measures of spread and center that the best constant predictor of \(Y\), in the sense of minimizing mean square error, is \(\E(Y)\) and the minimum value of the mean square error for this predictor is \(\var(Y)\). Thus, the difference between the variance of \(Y\) and the mean square error in Exercise 21 is the reduction in the variance of \(Y\) when the linear term in \(X\) is added to the predictor.
\(\var(Y) - \E\left([Y - L(Y \mid X)]^2\right) = \var(Y) \, \cor^2(X, Y)\).
Thus \(\cor^2(X, Y)\) is the proportion of reduction in \(\var(Y)\) when \(X\) is included as a predictor variable. This quantity is called the (distribution) coefficient of determination. Now let
\[ L(Y \mid X = x) = \E(Y) + \frac{\cov(X, Y)}{\var(X)}[x - \E(X)], \quad x \in \R \]The function \(x \mapsto L(Y \mid X = x)\) is known as the distribution regression function for \(Y\) given \(X\), and its graph is known as the distribution regression line. Note that the regression line passes through \((\E(X), \E(Y))\), the center of the joint distribution.
\(\E[L(Y \mid X)] = \E(Y)\).
From the linearity of expected value,
\[ \E[L(Y \mid X)] = \E(Y) + \frac{\cov(X, Y)}{\var(X)}[\E(X) - \E(X)] = \E(Y) \]However, the choice of predictor variable and response variable is crucial.
The regression line for \(Y\) given \(X\) and the regression line for \(X\) given \(Y\) are not the same line, except in the trivial case where the variables are perfectly correlated. However, the coefficient of determination is the same, regardless of which variable is the predictor and which is the response.
The two regression lines are
\[ \begin{align} y - \E(Y) & = \frac{\cov(X, Y)}{\var(X)}[x - \E(X)] \\ x - \E(X) & = \frac{\cov(X, Y)}{\var(Y)}[y - \E(Y)] \end{align} \]The two lines are the same if and only if \( \cov^2(X, Y) = \var(X) \var(Y) \). But this is equivalent to \( \cor^2(X, Y) = 1 \).
Suppose that \(A\) and \(B\) are events in a random experiment with \(0 \lt \P(A) \lt 1\) and \(0 \lt \P(B) \lt 1\). Then
The concept of best linear predictor is more powerful than might first appear, because it can be applied to transformations of the variables. Specifically, suppose that \(X\) and \(Y\) are random variables for our experiment, taking values in general spaces \(S\) and \(T\), respectively. Suppose also that \(g\) and \(h\) are real-valued functions defined on \(S\) and \(T\), respectively. We can find \(L[h(Y) \mid g(X)]\), the linear function of \(g(X)\) that is closest to \(h(Y)\) in the mean square sense. The results of this subsection apply, of course, with \(g(X)\) replacing \(X\) and \(h(Y)\) replacing \(Y\).
Suppose that \(Z\) is another real-valued random variable for the experiment and that \(c\) is a constant. Then
These results follow easily from the linearity of expected value and covariance. For part (a),
\[ \begin{align} L(Y + Z \mid X) & = \E(Y + Z) + \frac{\cov(X, Y + Z)}{\var(X)}[X - \E(X)] \\ &= \left(\E(Y) + \frac{\cov(X, Y)}{\var(X)} [X - \E(X)]\right) + \left(\E(Z) + \frac{\cov(X, Z)}{\var(X)}[X - \E(X)]\right) \\ & = \E(Y \mid X) + \E(Z \mid X) \end{align} \]For part (b),
\[ L(c Y \mid X) = \E(c Y) + \frac{\cov(X, cY)}{\var(X)}[X - \E(X)] = c \E(Y) + c \frac{\cov(X, Y)}{\var(X)}[X - \E(X)] = c L(Y \mid X) \]There are several extensions and generalizations of the ideas in the subsection:
Suppose that \(X\) is uniformly distributed on the interval \([-1, 1]\) and \(Y = X^2\). Then \(X\) and \(Y\) are uncorrelated even though \(Y\) is a function of \(X\) (the strongest form of dependence).
Note that \( \E(X) = 0 \) and \( \E(Y) = \E(X^2) = 1 / 6 \) and \( \E(X Y) = E(X^3) = 0 \). Hence \( \cov(X, Y) = \E(X Y) - \E(X) \E(Y) = 0 \).
Suppose that \((X, Y)\) is uniformly distributed on the region \(S \subseteq \R^2\). Find \(\cov(X, Y)\) and \(\cor(X, Y)\) and determine whether the variables are independent in each of the following cases:
In the bivariate uniform experiment, select each of the regions below in turn. For each region, run the simulation 2000 times and note the value of the correlation and the shape of the cloud of points in the scatterplot. Compare with the results in the last exercise.
Suppose that \(X\) is uniformly distributed on the interval \((0, 1)\) and that given \(X = x \in (0, 1)\), \(Y\) is uniformly distributed on the interval \((0, x)\). Find each of the following:
Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 have probability \(\frac{1}{4}\) each, and faces 2, 3, 4, and 5 have probability \(\frac{1}{8}\) each.
A pair of standard, fair dice are thrown and the scores \((X_1, X_2)\) recorded. Let \(Y = X_1 + X_2\) denote the sum of the scores, \(U = \min\{X_1, X_2\}\) the minimum scores, and \(V = \max\{X_1, X_2\}\) the maximum score. Find the covariance and correlation of each of the following pairs of variables:
Suppose that \(n\) fair dice are thrown. Find the mean and variance of each of the following variables:
In the dice experiment, select the following random variables. In each case, increase the number of dice and observe the size and location of the probability density function and the mean-standard deviation bar. With \(n = 20\) dice, run the experiment 1000 times and note the apparent convergence of the empirical moments to the distribution moments.
Repeat Exercise 35 for ace-six flat dice.
Repeat Exercise 36 for ace-six flat dice.
A pair of fair dice are thrown and the scores \((X_1, X_2)\) recorded. Let \(Y = X_1 + X_2\) denote the sum of the scores, \(U = \min\{X_1, X_2\}\) the minimum score, and \(V = \max\{X_1, X_2\}\) the maximum score. Find each of the following:
Recall that a Bernoulli trials process is a sequence \(\boldsymbol{X} = (X_1, X_2, \ldots)\) of independent, identically distributed indicator random variables. In the usual language of reliability, \(X_i\) denotes the outcome of trial \(i\), where 1 denotes success and 0 denotes failure. The probability of success \(p = \P(X_i = 1)\) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate chapter on the Bernoulli Trials explores this process in detail.
The number of successes in the first \(n\) trials is \(Y = \sum_{i=1}^n X_i\). Recall that this random variable has the binomial distribution with parameters \(n\) and \(p\), which has probability density function
\[ f(y) = \binom{n}{y} p^y (1 - p)^{n - y}, \quad y \in \{0, 1, \ldots, n\} \]The mean and variance of \(Y\) are
These results could be derived from the PDF of \( Y \), of course, but a derivation based on the sum of IID variables is much better. Recall that \( \E(X_i) = p \) and \( \var(X_i) = p (1 - p) \) so the results follow immediately from Theorem 14.
In the binomial coin experiment, select the number of heads. Vary \(n\) and \(p\) and note the shape of the probability density function and the size and location of the mean-standard deviation bar. For selected values of the parameters, run the experiment 1000 times and note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.
The proportion of successes in the first \(n\) trials is \(M_n = Y_n / n\). This random variable is sometimes used as a statistical estimator of the parameter \(p\), when the parameter is unknown.
The mean and variance of \(M\) are
These results follow immediately from the previous theorem and Theorem 15.
In the binomial coin experiment, select the proportion of heads. Vary \(n\) and \(p\) and note the shape of the probability density function and the size and location of the mean-standard deviation bar. For selected values of the parameters, run the experiment 1000 times and note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.
Suppose that a population consists of \(m\) objects; \(r\) of the objects are type 1 and \(m - r\) are type 0. A sample of \(n\) objects is chosen at random, without replacement. Let \(X_i\) denote the type of the \(i\)th object selected. Recall that \((X_1, X_2, \ldots, X_n)\) is a sequence of identically distributed (but not independent) indicator random variables.
Let \(Y\) denote the number of type 1 objects in the sample, so that \(Y = \sum_{i=1}^n X_i\). Recall that this random variable has the hypergeometric distribution, which has probability density function.
\[ f(y) = \frac{\binom{r}{y} \binom{m - r}{n - y}}{\binom{m}{n}}, \quad y \in \{0, 1, \ldots, n\} \]For distinct \(i\) and \(j\),
Recall that the sequence of indicator variables is exchangeable and that \( \E(X_i) = \P(X_i = 1) = \frac{r}{m} \) for each \( i \) and \( \E(X_i X_j) = \P(X_i = 1, X_j = 1) = \frac{r}{m} \frac{r - 1}{m - 1} \) for each \( i \ne j \). The results now follow from the definitions and simple algebra.
Note that the event of a type 1 object on draw \(i\) and the event of a type 1 object on draw \(j\) are negatively correlated, but the correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if \(m = 2\). Think about these result intuitively.
The mean and variance of \(Y\) are
Again, a derivation from the representation of \( Y \) as a sum of indicator variables is far preferable to a derivation based on the PDF of \( Y \). These results follow immediately from the previous theorem, the additiviity of expected value, and Theorem 11.
In the ball and urn experiment, select sampling without replacement. Vary \(m\), \(r\), and \(n\) and note the shape of the probability density function and the size and location of the mean-standard deviation bar. For selected values of the parameters, run the experiment 1000 times and note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.
Suppose that \(X\) and \(Y\) are real-valued random variables with \(\cov(X, Y) = 3\). Find \(\cov(2 X - 5, 4 Y + 2)\).
24
Suppose \(X\) and \(Y\) are real-valued random variables with \(\var(X) = 5\), \(\var(Y) = 9\), and \(\cov(X, Y) = - 3\). Find \(\var(2 X + 3 Y - 7)\).
65
Suppose that \(X\) and \(Y\) are independent, real-valued random variables with \(\var(X) = 6\) and \(\var(Y) = 8\). Find \(\var(3 X - 4 Y + 5)\).
182
Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A) = \frac{1}{2}\), \(\P(B) = \frac{1}{3}\), and \(\P(A \cap B) = \frac{1}{8}\). Find the covariance and correlation between \(A\) and \(B\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Find each of the following
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 2 (x + y)\) for \(0 \le x \le y \le 1\). Find each of the following:
Suppose again that \((X, Y)\) has probability density function \(f(x, y) = 2 (x + y)\) for \(0 \le x \le y \le 1\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 6 x^2 y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Find each of the following:
Note that \(X\) and \(Y\) are independent.
Suppose that \((X, Y)\) has probability density function \(f(x, y) = 15 x^2 y\) for \(0 \le x \le y \le 1\). Find each of the following:
Suppose again that \((X, Y)\) has probability density function \(f(x, y) = 15 x^2 y\) for \(0 \le x \le y \le 1\).
Covariance is closely related to the concept of inner product in the theory of vector spaces. This connection can help illustrate many of the properties of covariance from a different point of view.
In this section, our vector space \(\mathscr{V}_2\) consists of all real-valued random variables defined on a fixed probability space \((\Omega, \mathscr{F}, \P)\) (that is, relative to the same random experiment) that have finite second moment. Recall that two random variables are equivalent if they are equal with probability 1. As usual, we consider two such random variables as the same vector, so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the usual multiplication of a real-valued random variable by a real (non-random) number.
If \(X\) and \(Y\) are random variables in \(\mathscr{V}_2\), we define the inner product of \(X\) and \(Y\) by
\[ \langle X, Y \rangle = \E(X Y) \]The following exercise gives results that are analogs of the basic properties of covariance given above, and show that this definition really does give an inner product on the vector space
Inner product satisfies the following properties:
Part (a) is trivial from the definition. Parts (b) and (c) follow from the basic inequality properties for the expected value of a nonnegative random variable: \( \E(X^2) \ge 0 \) and \( \E(X^2) = 0 \) if and only if \( \P(X = 0) = 1 \). Parts (d) and (e) follow from the linearity properties of expected value: \( \E(a X Y) = a \E(X Y) \), \( \E[(X + Y) Z] = \E(X Z) + \E(Y Z) \).
Covariance and correlation can easily be expressed in terms of this inner product. The covariance of two random variables is the inner product of the corresponding centered variables. The correlation is the inner product of the corresponding standard scores.
Covariance and correlation can be expressd in terms of inner product as follows:
Part (a) is simply a restatement of the definition of covariance. Part (b) is a restatement of Theorem 7.
The norm associated with the inner product is the 2-norm studied in the last section, and corresponds to the root mean square operation on a random variable. This fact is a fundamental reason why the 2-norm plays such a special, honored role; of all the \(k\)-norms, only the 2-norm corresponds to an inner product. In turn, this is one of the reasons that root mean square difference is of fundamental importance in probability and statistics. Technically, the vector space \( \mathscr{V}_2 \) is a Hilbert space, named for David Hilbert.
\(\langle X, X \rangle = \|X\|_2^2 = \E(X^2)\).
Let \(X\) and \(Y\) be random variables in \(\mathscr{V}_2\).
The following set is a subspace of \(\mathscr{V}_2\). In fact, it is the subspace generated by \(X\) and 1.
\[ \mathscr{W} = \{a + b X: a \in \R, \; b \in \R\} \]Note that \( \mathscr{W} \) is the set of all linear combinations of the vectors \( 1 \) and \( X \). The statement that \( \mathscr{W} \) is a vector space means that \( \mathscr{W} \) is closed under addition and scalar multiplication: if \( U, \; V \in \mathscr{W} \) then \( U + V \in \mathscr{W} \). If \( U \in \mathscr{W} \) and \( c \in \R \) then \( c U \in \mathscr{W} \).
The best linear predictor of \(Y\) given \(X\) can be characterized as the projection of \(Y\) onto the subspace \(\mathscr{W}\). That is, \(L(Y \mid X)\) is the only random variable \(W \in \mathscr{W}\) with the property that \(Y - W\) is perpendicular to \(\mathscr{W}\). Specifically \(L(Y \mid X)\) is the only \(W \in \mathscr{W}\) that satisfies the following properties:
Let \( W = L(Y \mid X) = \E(Y) + [\cov(X, Y) / \var(X)] [X - \E(X)] \). Clearly \( W \in \mathscr{W} \). Note that \( \E\left(X [X - \E(X)]\right) = \var(X) \). Hence \( \E(X W) = \E(X) \E(Y) + \cov(X, Y) = \E(X Y) \). This gives (a). We already showed in Theorem 24 that \( \E(W) = \E(Y) \) so (b) holds as well.
Conversely, suppose that \( W = a + b X \) for some \( a, \; b \in \R \) and that (a) and (b) are satisfied. Then from (b), \( \E(W) = a + b \E(X) = \E(Y) \) so \( a = \E(Y) - b \E(X) \). From (a), \( \E(X W) = a \E(X) + b \E(X^2) = \E(X Y) \). Substituting the expression for \( a \) gives \( b = \cov(X, Y) / \var(X) \). Substituting back gives \( a = \E(Y) - \cov(X, Y) \E(X) / \var(X) \). Hence \( W = L(Y \mid X) \).
The next exercise gives Hölder's inequality, named for Otto Hölder.
Suppose that \(j \gt 1\), \(k \gt 1\), and \(\frac{1}{j} + \frac{1}{k} = 1\). Then \(\langle |X|, |Y| \rangle \le \|X\|_j \|Y\|_k \).
Note that \(S = \{(x, y) \in \R^2: x \ge 0, \; y \ge 0\}\) is a convex set and \(g(x, y) = x^{1/j} y^{1/k}\) is concave on \(S\). From Jensen's inequality, if \(U\) and \(V\) are nonnegative random variables then \(\E(U^{1/j} V^{1/k}) \le [\E(U)]^{1/j} [\E(V)]^{1/k}\). Substituting \(U = |X|^j\) and \(V = |Y|^k\) gives the result.
To show that \( g \) really is concave on \( S \), we compute the second derivative matrix:
\[ \left[ \begin{matrix} (1 / j)(1 / j - 1) x^{1 / j - 2} y^{1 / k} & (1 / j)(1 / k) x^{1 / j - 1} y^{1 / k - 1} \\ (1 / j)(1 / k) x^{1 / j - 1} y^{1 / k - 1} & (1 / k)(1 / k - 1) x^{1 / j} y^{1 / k - 2} \end{matrix} \right] \]Since \( 1 / j \lt 1 \) and \( 1 / k \lt 1 \), the diagonal entries are negative on \( S \). The determinant simplifies to
\[ (1 / j)(1 / k) x^{2 / j - 2} y^{2 / k - 2} [1 - (1 / j + 1 / k)] = 0 \]In the context of the last theorem, \(j\) and \(k\) are called conjugate exponents. If we let \(j = k = 2\) in Hölder's inequality, then we get the Cauchy-Schwarz inequality, named for Augustin Cauchy and Karl Schwarz. In turn, this is equivalent to the inequalities in Exercise 22.
\[ \E(|X| |Y|) \le \sqrt{\E(X^2)} \sqrt{\E(Y^2)} \]Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Verify Hölder's inequality in the following cases:
If \(j\) and \(k\) are conjugate exponents then
The following exercise is an analog of the result in Exercise 12.
The parallelogram rule: if \(X, \; Y \in \mathscr{V}_2\) then
\[ \|X + Y\|_2^2 + \|X - Y\|_2^2 = 2 \|X\|_2^2 + 2 \|Y\|_2^2\]The following exercise is an analog of the result in Exercise 11.
The Pythagorean theorem, named for Pythagoras of course: if \((X_1, X_2, \ldots, X_n)\) is a sequence of random variables in \(\mathscr{V}_2\) with \(\langle X_i, X_j \rangle = 0\) for \(i \ne j\) then
\[ \left \| \sum_{i=1}^n X_i \right \|_2^2 = \sum_{i=1}^n \|X_i\|_2^2 \]