]>
Recall that by taking the expected value of various transformations of a random variable, we can measure many interesting characteristics of the distribution of the variable. In this section, we will study an expected value that measures a special type of relationship between two real-valued variables. This relationship is very important both in probability and statistics.
As usual, our starting point is a random experiment with probability measure on an underlying sample space. Suppose that and are real-valued random variables for the experiment with means , and variances , , respectively (assumed finite). The covariance of and is defined by
and (assuming the variances are positive) the correlation of and is defined by
Correlation is a scaled version of covariance; note that the two parameters always have the same sign (positive, negative, or 0). When the sign is positive, the variables are said to be positively correlated; when the sign is negative, the variables are said to be negatively correlated; and when the sign is 0, the variables are said to be uncorrelated. Note also that correlation is dimensionless, since the numerator and denominator have the same physical units.
As these terms suggest, covariance and correlation measure a certain kind of dependence between the variables. One of our goals is a deep understanding of this dependence. As a start, note that is the center of the joint distribution of , and the vertical and horizontal line through this point separate into four quadrants. The function is positive on the first and third of these quadrants and negative on the second and fourth.
The following exercises give some basic properties of covariance. The main tool that you will need is the fact that expected value is a linear operation. Other important properties will be derived below, in the subsection on the best linear predictor.
Show that .
By Exercise 1, we see that and are uncorrelated if and only if . In particular, if and are independent, then they are uncorrelated. However, the converse fails with a passion, as the following exercise shows. (Other examples of dependent yet uncorrelated variables occur in the computational exercises.)
Suppose that is uniformly distributed on the interval , where , and Show that and are uncorrelated even though is a function of (the strongest form of dependence).
Show that .
Show that . Thus, covariance subsumes variance.
Show that
Suppose that and are sequences of real-valued random variables for an experiment. Prove the following property (known as bi-linearity).
Show that the correlation between and is simply the covariance of the corresponding standard scores:
We will now show that the variance of a sum of variables is the sum of the pairwise covariances. This result is very useful since many random variables with common distributions can be written as sums of simpler random variables (see in particular the binomial distribution and hypergeometric distribution below).
Suppose that is a sequence of real-valued random variables. Use Exercise 3, Exercise 4, and Exercise 5 to show that
Note that the variance of a sum can be greater, smaller, or equal to the sum of the variances, depending on the pure covariance terms.
Note that the result in the previous exercise holds, in particular, if the random variables are mutually independent.
Suppose that and are real-valued random variables. Show that .
Suppose that and are real-valued random variables with . Show that and are uncorrelated.
In the following exercises, suppose that is a sequence of independent, real-valued random variables with a common distribution that has mean and standard deviation . (Thus, the variables form a random sample from the common distribution).
Let . Show that
Let . Thus, is the sample mean. Show that
Part (b) of the last exercise means that as in mean square. Part (c) means that as in probability. These are both versions of the weak law of large numbers, one of the fundamental theorems of probability.
Let . Thus, is the standard score associated with . Show that
The central limit theorem, the other fundamental theorem of probability, states that the distribution of converges to the standard normal distribution as
Suppose that and are events in a random experiment. The covariance and correlation of and are defined to be the covariance and correlation, respectively, of their indicator random variables and .
Show that
In particular, note that and are positively correlated, negatively correlated, or independent, respectively (as defined in the section on conditional probability) if and only if the indicator variables of and are positively correlated, negatively correlated, or uncorrelated, as defined in this section.
Show that
Suppose that Show that
What linear function of is closest to in the sense of minimizing mean square error? The question is fundamentally important in the case where random variable (the predictor variable) is observable and random variable (the response variable) is not. The linear function can be used to estimate from an observed value of . Moreover, the solution will show that covariance and correlation measure the linear relationship between and . To avoid trivial cases, let us assume that and , so that the random variables really are random.
Let denote the mean square error when is used as an estimator of (as a function of the parameters and ):
.Show that
Use basic calculus to show that is minimized when
Thus, the best linear predictor of given is the random variable given by
Show that the minimum value of the mean square error function MSE, is
From the last exercise, verify the following important properties:
These exercises show clearly that and measure the linear association between and .
Recall that the best constant predictor of , in the sense of minimizing mean square error, is and the minimum value of the mean square error for this predictor is . Thus, the difference between the variance of and the mean square error in Exercise 20 is the reduction in the variance of when the linear term in is added to the predictor.
Show that The fraction of the reduction is , and hence this quantity is called the (distribution) coefficient of determination.
Now let
The function is known as the distribution regression function for given , and its graph is known as the distribution regression line. Note that the regression line passes through , the center of the joint distribution.
Show that .
However, the choice of predictor variable and response variable is crucial.
Show that regression line for given and the regression line for given are not the same line, except in the trivial case where the variables are perfectly correlated. However, the coefficient of determination is the same, regardless of which variable is the predictor and which is the response.
Suppose that and are events in a random experiment with and . Show that
The concept of best linear predictor is more powerful than might first appear, because it can be applied to transformations of the variables. Specifically, suppose that and are random variables for our experiment, taking values in general spaces and , respectively. Suppose also that and are real-valued functions defined on and , respectively. We can find , the linear function of that is closest to in the mean square sense. The results of this subsection apply, of course, with replacing and replacing .
Suppose that is another real-valued random variable for the experiment and that is a constant. Show that
There are several extensions and generalizations of the ideas in the subsection:
Suppose that is uniformly distributed on the region . Find and and determine whether the variables are independent in each of the following cases:
In the bivariate uniform experiment, select each of the regions below in turn. For each region, run the simulation 2000 times, updating every 10 runs. Note the value of the correlation and the shape of the cloud of points in the scatterplot. Compare with the results in the last exercise.
Suppose that is uniformly distributed on the interval and that given , is uniformly distributed on the interval .
Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability each.
A pair of standard, fair dice are thrown and the scores recorded. Let denote the sum of the scores, the minimum scores, and the maximum score. Find the covariance and correlation of each of the following pairs of variables:
Suppose that fair dice are thrown. Find the mean and variance of each of the following variables:
In the dice experiment, select the following random variables. In each case, increase the number of dice and observe the size and location of the density function and the mean-standard deviation bar. With dice, run the experiment 1000 times, updating every 10 runs, and note the apparent convergence of the empirical moments to the distribution moments.
Repeat Exercise 31 for ace-six flat dice.
Repeat Exercise 32 for ace-six flat dice.
A pair of fair dice are thrown and the scores recorded. Let denote the sum of the scores, the minimum score, and the maximum score. Find each of the following:
Recall that a Bernoulli trials process is a sequence of independent, identically distributed indicator random variables. In the usual language of reliability, denotes the outcome of trial , where 1 denotes success and 0 denotes failure. The probability of success is the basic parameter of the process. The process is named for James Bernoulli. A separate chapter on the Bernoulli Trials explores this process in detail.
The number of successes in the first trials is . Recall that this random variable has the binomial distribution with parameters and , which has probability density function
Show that
In the binomial coin experiment, select the number of heads. Vary and and note the shape of the density function and the size and location of the mean-standard deviation bar. For selected values of and , run the experiment 1000 times, updating every 10 runs, and note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.
The proportion of successes in the first trials is . This random variable is sometimes used as a statistical estimator of the parameter , when the parameter is unknown.
Show that
In the binomial coin experiment, select the proportion of heads. Vary and and note the shape of the density function and the size and location of the mean-standard deviation bar. For selected values of and , run the experiment 1000 times, updating every 10 runs, and note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.
Suppose that a population consists of objects; of the objects are type 1 and are type 0. A sample of objects is chosen at random, without replacement. Let denote the type of the object selected. Recall that is a sequence of identically distributed (but not independent) indicator random variables. In fact the sequence is exchangeable.
Let denote the number of type 1 objects in the sample, so that . Recall that this random variable has the hypergeometric distribution, which has probability density function.
.Show that for distinct and ,
Note that the event of a type 1 object on draw and the event of a type 1 object on draw are negatively correlated, but the correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if . Think about these result intuitively.
Show that
In the ball and urn experiment, select sampling without replacement. Vary , , and and note the shape of the density function and the size and location of the mean-standard deviation bar. For selected values of the parameters, run the experiment 1000 times, updating every 10 runs, and note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.
Suppose that and are events in an experiment with , , and . Find the covariance and correlation between and .
Suppose again that has probability density function .
Suppose again that has probability density function .
Covariance is closely related to the concept of inner product in the theory of vector spaces. This connection can help illustrate many of the properties of covariance from a different point of view.
In this section, our vector space consists of all real-valued random variables defined on a fixed probability space (that is, relative to the same random experiment) that have finite second moment. Recall that two random variables are equivalent if they are equal with probability 1. As usual, we consider two such random variables as the same vector, so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the usual multiplication of a real-valued random variable by a real (non-random) number.
If and are random variables in , we define the inner product of and by
The following exercise gives results that are analogs of the basic properties of covariance given above, and show that this definition really does give an inner product on the vector space
Show that
Covariance and correlation can easily be expressed in terms of this inner product. The covariance of two random variables is the inner product of the corresponding centered variables. The correlation is the inner product of the corresponding standard scores.
Show that
The norm associated with the inner product is the 2-norm studied in the last section, and corresponds to the root mean square operation on a random variable. This fact is a fundamental reason why the 2-norm plays such a special, honored role; of all the -norms, only the 2-norm corresponds to an inner product. In turn, this is one of the reasons that root mean square difference is of fundamental importance in probability and statistics.
Show that .
Let and be random variables in
Show that the following set is a subspace of . In fact, it is the subspace generated by and 1.
Show that the best linear predictor of given can be characterized as the projection of onto the subspace . That is, show that is the only random variable with the property that is perpendicular to . Specifically, find such that satisfies the following two conditions:
The next exercise gives Hölder's inequality, named for Otto Hölder.
Suppose that , , and . Show that using the steps below:
In the context of the last exercise, and are called conjugate exponents. If we let in Hölder's inequality, then we get the Cauchy-Schwarz inequality, named for Augustin Cauchy and Karl Schwarz. In turn, this is equivalent to the inequalities in Exercise 21.
Suppose that and are conjugate exponents.
The following exercise is an analog of the result in Exercise 10.
Prove the parallelogram rule:
The following exercise is an analog of the result in Exercise 9.
Prove the Pythagorean theorem, named for Pythagoras of course: if is a sequence of real-valued random variables with for then