]>
Suppose again that we have a basic random experiment, and that and are real-valued random variables for the experiment. Equivalently, is a random vector taking values in . Please recall the basic properties of the means, and , the variances, and and the covariance . In particular, recall that the correlation is
We will also need a higher order bivariate moment. Let
Now suppose that we run the basic experiment times. This creates a compound experiment with a sequence of independent random vectors each with the same distribution as . In statistical terms, this is a random sample of size from the distribution of . As usual, we will let denote the sequence of first coordinates; this is a random sample of size from the distribution of . Similarly, we will let denote the sequence of second coordinates; this is a random sample of size from the distribution of .
Recall that the sample means and sample variances for are defined as follows (and of course analogous definitions hold for ):.
In this section, we will define and study statistics that are natural estimators of the distribution covariance and correlation. These statistics will be measures of the linear relationship of the sample points in the plane. As usual, the definitions depend on what other parameters are known and unknown.
Suppose first that the distribution means and are known. This is usually an unrealistic assumption, of course, but is still a good place to start because the analysis is very simple and the results we obtain will be useful below. A natural estimator of in this case is
Show that is the sample mean for a random sample of size from the distribution of .
Use the result of Exercise 1 to show that
In particular, is an unbiased and consistent estimator of .
The formula in the following exercise is sometimes better than the definition for computational purposes.
With defined to be the sequence , show that
The properties established in the following exercises are analogies of properties for the distribution covariance
Show that
Show that
Show that if is a constant then
Show that
The following exercise gives a formula for the sample variance of a sum. The result extends naturally to larger sums.
Show that
Consider now the more realistic assumption that the distribution means and are unknown. A natural approach in this case is to average over . But rather than dividing by in our average, we should divide by whatever constant gives an unbiased estimator of .
Interpret the sign of geometrically, in terms of the scatterplot of points and its center.
Use the bilinearity of the covariance operator to show that
.Expand and sum term by term to show that
Use the result of Exercises 10 and 11, and basic properties of expected value, to show that
Therefore, to have an unbiased estimator of , we should define the sample covariance to be the random variable
As with the sample variance, when the sample size is large, it makes little difference whether we divide by or .
The formula in the following exercise is sometimes better than the definition for computational purposes.
With defined as in Exercise 3, show that
Use the result of the previous exercise and the strong law of large numbers to show that as with probability 1.
The properties established in the following exercises are analogies of properties for the distribution covariance
Show that
Show that
Show that if is a constant then
Show that
Show that
The following exercise gives a formula for the sample variance of a sum. The result extends naturally to larger sums.
Show that
In this subsection we will derive the following formuala for the variance of the sample covariance. The derivation was contributed by Ranjith Unnikrishnan, and is similar to the derivation of the variance of the sample variance.
Verify the following result. Hint: Start with the expression on the right. Expand the product , and take the sums term by term.
It follows that is the sum of all of the pairwise covariances of the terms in the expansion of Exercise 21.
Now, derive the formula for by showing that
Show that . Does this seem reasonable?
Show that as . Thus, the sample covariance is a consistent estimator of the distribution covariance.
By analogy with the distribution correlation, the sample correlation is obtained by dividing the sample covariance by the product of the sample standard deviations:
Use the strong law of large numbers to show that as with probability 1.
Click in the interactive scatterplot to define 20 points and try to come as close as possible to the following conditions: sample means 0, sample standard deviations 1, sample correlation as follows: 0, 0.5, −0.5, 0.7, −0.7, 0.9, −0.9.
Click in the interactive scatterplot to define 20 points and try to come as close as possible to the following conditions: sample mean 1, sample mean 3, sample standard deviation 2, sample standard deviation 1, sample correlation as follows: 0, 0.5, −0.5, 0.7, −0.7, 0.9, −0.9.
Recall that in the section on (distribution) correlation and regression, we showed that the best linear predictor of based on , in the sense of minimizing mean square error, is the random variable
Moreover, the (minimum) value of the mean square error is
The distribution regression line is given by
Of course, in real applications, we are unlikely to know the distribution parameters , , , and . Thus, in this section, we are interested in the problem of estimating the best linear predictor of based on from our random sample . One natural approach is to find the line that fits the sample points best. This is a basic and important problem in many areas of mathematics, not just statistics. The term best means that we want to find the line (that is, find and ) that minimizes the average of the squared errors between the actual values in our data and the predicted values:
Finding and that minimize MSE is a standard problem in calculus.
Show that MSE is minimized for
and thus the sample regression line is
Show that the minimum mean square error, using the coefficients in the previous exercise, is
Use the result of the previous exercise to show that
Thus, the sample correlation measures the degree of linearity of the sample points. The results in the previous exercise can also be obtained by noting that the sample correlation is simply the correlation of the empirical distribution. Of course, properties (a), (b), and (c) are known for the distribution correlation.
The fact that the results in Exercise 28 and Exercise 29 are the sample analogies of the corresponding distribution results is beautiful and reassuring. Note that the sample regression line passes through , the center of the empirical distribution. Naturally, the coefficients of the sample regression line can be viewed as estimators of the respective coefficients in the distribution regression line.
Assuming that the appropriate higher order moments are finite, use the law of large numbers to show that, with probability 1, the coefficients of the sample regression line converge to the coefficients of the distribution regression line:
As with the distribution regression lines, the choice of predictor and response variables is important.
Show that the sample regression line for based on and the sample regression line for based on are not the same line, except in the trivial case where the sample points all lie on a line.
Recall that the constant that minimizes
is the sample mean , and the minimum value of the mean square error is the sample variance . Thus, the difference between this value of the mean square error and the one in Exercise 29, namely is the reduction in the variability of the data when the linear term in is added to the predictor. The fractional reduction is , and hence this statistics is called the (sample) coefficient of determination.
Click in the interactive scatterplot, in various places, and watch how the regression line changes.
Click in the interactive scatterplot to define 20 points. Try to generate a scatterplot in which the mean of the values is 0, the standard deviation of the values is 1, and in which the regression line has
Click in the interactive scatterplot to define 20 points with the following properties: the mean of the values is 1, the mean of the values is 1, and the regression line has slope 1 and intercept 2.
If you had a difficult time with the previous exercise, it's because the conditions imposed are impossible to satisfy!
Run the bivariate uniform experiment 2000 times, with an update frequency of 10, in each of the following cases. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to distribution regression line.
Run the bivariate normal experiment 2000 times, with an update frequency of 10, in each of the following cases. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to the distribution regression line.
Compute the correlation between petal length and petal width for the following cases in Fisher's iris data. Comment on the differences.
Compute the correlation between each pair of color count variables in the M&M data
Consider all cases in Fisher's iris data.
Consider the Setosa cases only in Fisher's iris data.