Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that we make on these objects. We select objects from the population and record the variables for these objects; these become our data. Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are generated by an underlying probability distribution.
Suppose that \(x\) and \(y\) are real-valued variables for a population, and that \(((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n))\) is an observed sample of size \(n\) from \((x, y)\). We will let \(\bs{x} = (x_1, x_2, \ldots, x_n)\) denote the sample from \(x\) and \(\bs{y} = (y_1, y_2, \ldots, y_n)\) the sample from \(y\). In this section, we are interested in measures of association between the \(x\) and \(y\) data, and in finding the line (or other curve) that best fits the data.
Recall that the sample means are
\[ m(\bs{x}) = \frac{1}{n} \sum_{i=1}^n x_i, \quad m(\bs{y}) = \frac{1}{n} \sum_{i=1}^n y_i \]and the sample variances are
\[ s^2(\bs{x}) = \frac{1}{n - 1} \sum_{i=1}^n [x_i - m(\bs{x})]^2, \quad s^2(\bs{y}) = \frac{1}{n - 1} \sum_{i=1}^n [y_i - m(\bs{y})]^2 \]Often, the first step in exploratory data analysis is to draw a graph of the points; this is called a scatterplot an can give a visual sense of the statistical realtionship between the variables.
In particular, we are interested in whether the cloud of points seems to show a linear trend or whether some nonlinear curve might fit the cloud of points. We are interested in the extent to which one variable \(x\) can be used to predict the other variable \(y\).
Our next goal is to define statistics that measure the association between the \(x\) and \(y\) data. The sample covariance is defined to be
\[ s(\bs{x}, \bs{y}) = \frac{1}{n - 1} \sum_{i=1}^n [x_i - m(\bs{x})][y_i - m(\bs{y})] \]Note that the sample covariance is an average of the product of the deviations of the \(x\) and \(y\) data from their means. The sample correlation is defined to be
\[ r(\bs{x}, \bs{y}) = \frac{s(\bs{x}, \bs{y})}{s(\bs{x}) \, s(\bs{y})} \]assuming that the data vectors are not constant, so that the standard deviations are positive. Correlation is a standardized version of covariance. In particular, correlation is dimensionless (has no physical units), since the covariance in the numerator and the product of the standard devations in the denominator have the same units (the product of the units of \(x\) and \(y\)). Note also that covariance and correlation have the same sign: positive, negative, or zero. In the first case, the data \(\bs{x}\) and \(\bs{y}\) are said to be positively correlated; in the second case \(\bs{x}\) and \(\bs{y}\) are said to be negatively correlated; and in the third case \(\bs{x}\) and \(\bs{y}\) are said to be uncorrelated
To see that the sample covariance is a measure of association, recall first that the point \((m(\bs{x}), m(\bs{y}))\) is a measure of the center of the bivariate data. Indeed, if each point is the location of a unit mass, then \((m(\bs{x}), m(\bs{y}))\) is the center of mass as defined in phyiscs. Horizontal and vertical lines through this center point divide the plane into four quadrants. The product deviation \([x_i - m(\bs{x})][y_i - m(\bs{y)}]\) is positive in the first and third quadrants and negative in the second and fourth quadrants. After we study linear regression below, we will have a much deeper sense of what covariance measures.
You may be perplexed that we average the product deviations by dividing by \(n - 1\) rather than \(n\). The best explanation is that in the bivariate probability model, the sample covariance is an unbiased estimator of the distribution covariance. However, the mode of averaging can also be understood in terms of degrees of freedom, as was done for sample variance. Initially, we have \(2 \, n\) degrees of freedom in the bivariate data. We lose two by computing the sample means \(m(\bs{x})\) and \(m(\bs{y})\). Of the remaining \(2 \, n - 2\) degrees of freedom, we lose \(n - 1\) by computing the product deviations. Thus, we are left with \(n - 1\) degrees of freedom total. As is typical in statistics, we average not by dividing by the number of terms in the sum but rather by the number of degrees of freedom in those terms. However, from a purely descriptive point of view, it would also be reasonable to divide by \(n\).
Recall that there is a natural probability distribution associated with the data, namely the empirical distribution that gives probability \(\frac{1}{n}\) to each data point \((x_i, y_i)\). (Thus, if these points are distinct this is the uniform distribution on the data.) The sample means are simply the expected values of this bivariate distribution, and except for a constant multiple (dividing by \(n - 1\) rather than \(n\)), the sample variances are simply the variances of this bivarite distribution. Similarly, except for a constant multiple (again dividing by \(n - 1\) rather than \(n\)), the sample covariance is the covariance of the bivariate distribution and the sample correlation is the correlation of the bivariate distribution. All of the following results in our discussion of descriptive statistics are actually special cases of more general results for probability distributions.
The next few exercises establish some essential properties of sample covariance. As usual, bold symbols denote samples of a fixed size \(n\) from the corresponding population variables (that is, vectors of length \(n\)), while symbols in regular type denote real numbers. Our first result is a formula for sample covariance that is sometimes better than the definition for computational purposes. To state the result succinctly, let \(\bs{x} \, \bs{y} = (x_1 \, y_1, x_2 \, y_2, \ldots, x_n \, y_n)\) denote the sample from the product variable \(x \, y\).
The sample covariance can be computed as follows:
\[ s(\bs{x}, \bs{y}) = \frac{1}{n - 1} \sum_{i=1}^n x_i \, y_i - \frac{n}{n - 1} m(\bs{x}) \, m(\bs{y}) = \frac{n}{n - 1} [m(\bs{x} \, \bs{y}) - m(\bs{x}) \, m(\bs{y})] \]As the name suggests, sample covariance generalizes sample variance.
\(s(\bs{x}, \bs{x}) = s^2(\bs{x})\).
Sample covariance is symmetric.
\(s(\bs{x}, \bs{y}) = s(\bs{y}, \bs{x})\).
Sample covariance is linear in the first argument with the second argument fixed.
If \(\bs{x}\), \(\bs{y}\), and \(\bs{z}\) are data vectors from population variables \(x\), \(y\), and \(z\), respectively, and if \(c\) is a constant then
By symmetry, sample covariance is also linear in the second argument with the first argument fixed, and hence is bi-linear. The general version of the bi-linear property is given in the following exercise:
Suppose that \(\bs{x}_i\) is a data vector from a population variable \(x_i\) for \(i \in \{1, 2, \ldots, k\}\) and that \(\bs{y}_j\) is a data vector from a population variable \(y_j\) for \(j \in \{1, 2, \ldots, l\}\). Suppose also that \(a_1, \; a_2, \ldots, \; a_k\) and \(b_1, \; b_2, \ldots, b_l\) are constants. Then
\[ s \left( \sum_{i=1}^k a_i \, \bs{x}_i, \sum_{j = 1}^l b_j \, \bs{y}_j \right) = \sum_{i=1}^k \sum_{j=1}^l a_i \, b_j \, s(\bs{x}_i, \bs{y}_j) \]A special case of the bi-linear property provides a nice way to compute the sample variance of a sum.
\(s^2(\bs{x} + \bs{y}) = s^2(\bs{x}) + 2 \, s(\bs{x}, \bs{y}) + s^2(\bs{y})\).
The generalization of this result to sums of three or more vectors is completely straightforward. Note that the sample variance of a sum can be greater than, less than, or equal to the sum of the sample variances, depending on the sign and magnitude of the pure covariance term. In particular, if the vectors are uncorrelated, then the variance of the sum is the sum of the variances.
If \(\bs{c}\) is a constant data set then \(s(\bs{x}, \bs{c}) = 0\).
Combining the result in the last exercise with the bi-linear property, we see that covariance is unchanged if constants are added to the data sets. That is, if \(\bs{c}\) and \(\bs{d}\) are constant vectors then \(s(\bs{x} + \bs{c}, \bs{y} + \bs{d}) = s(\bs{x}, \bs{y})\).
A few simple properties of correlation are given in the following exercises. Most of these follow easily from the corresponding properties of covariance. First, recall that the standard scores of \(x_i\) and \(y_i\) are, respectively,
\[ u_i = \frac{x_i - m(\bs{x})}{s(\bs{x})}, \quad v_i = \frac{y_i - m(\bs{y})}{s(\bs{y})} \]The standard scores from a data set are dimensionless quantities that have mean 0 and variance 1.
SThe correlation between \(\bs{x}\) and \(\bs{y}\) is the covariance of the standard scores associated with \(\bs{x}\) and \(\bs{y}\). That is, in the notation above, \(r(\bs{x}, \bs{y}) = s(\bs{u}, \bs{v})\).
Correlation is symmetric.
\(r(\bs{x}, \bs{y}) = r(\bs{y}, \bs{x})\).
Unlike covariance, correlation is unaffected by multiplying one of the data sets by a positive constant (recall that this can always be thought of as a change of scale in the underlying variable). On the other hand, muliplying a data set by a negative constant changes the sign of the correlation.
If \(c\) is a constant then
Like covariance, correlation is unaffected by adding constants to the data sets
If \(\bs{c}\) and \(\bs{d}\) are constant vectors then \(r(\bs{x} + \bs{c}, \bs{y} + \bs{d}) = r(\bs{x}, \bs{y})\).
The last couple of properties reinforce the fact that correlation is a standardized measure of association that is not affected by changing the units of measurement. In the first Challenger data set, for example, the variables of interest are temperature at time of launch (in degrees Fahrenheit) and O-ring erosion (in millimeters). The correlation between these variables is of critical importance. If we were to measure temperature in degrees Celsius and O-ring erosion in inches, the correlation between the two variables would be unchanged.
The most important properties of correlation arise from studying the line that best fits the data, our next topic.
We are interested in finding the line \(y = a + b \, x\) that best fits the sample points \(((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n))\). This is a basic and important problem in many areas of mathematics, not just statistics. We think of \(x\) as the predictor variable and \(y\) as the response variable. Thus, the term best means that we want to find the line (that is, find the coefficients \(a\) and \(b\)) that minimizes the average of the squared errors between the actual \(y\) values in our data and the predicted \(y\) values:
\[ \mse(a, b) = \frac{1}{n - 1} \sum_{i=1}^n [y_i - (a + b \, x_i)]^2 \]Note that the minimizing value of \((a, b)\) would be the same if the function were simply the sum of the squared errors, of if we averaged by dividing by \(n\) rather than \(n - 1\), or if we used the square root of any of these functions. Of course that actual minimum value of the function would be different if we changed the function, but again, not the point \((a, b)\) where the minimum occurs. Our particular choice of \(\mse\) as the error function is best for statistical purposes. Finding \((a, b)\) that minimize \(\mse\) is a standard problem in calculus.
The graph of \(\mse\) is a paraboloid opening upward. The function \(\mse\) is minimized when
\[ \begin{align} b(\bs{x}, \bs{y}) & = \frac{s(\bs{x}, \bs{y})}{s^2(\bs{x})} \\ a(\bs{x}, \bs{y}) & = m(\bs{y}) - b(\bs{x}, \bs{y}) \, m(\bs{x}) = m(\bs{y}) - \frac{s(\bs{x}, \bs{y})}{s^2(\bs{x})} \, m(\bs{x}) \end{align} \]Thus the sample regression line is
\[ y = m(\bs{y}) + \frac{s(\bs{x}, \bs{y})}{s^2(\bs{x})} [x - m(\bs{x})] \]
Note that the regression line passes through the point \((m(\bs{x}, \bs{y})\), the center of the sample of points.
The minimum mean square error is
\[ \mse[a(\bs{x},\bs{y}), b(\bs{x}, \bs{y})] = s^2(\bs{y}) [1 - r^2(\bs{x}, \bs{y})] \]Sample correlation and covariance satisfy the following properties.
Thus, we now see in a deep way that the sample covariance and correlation measure the degree of linearity of the sample points. Recall from our discussion of measures of center and spread that the constant \(a\) that minimizes
\[ \mse(a) = \frac{1}{n - 1} \sum_{i=1}^n (y_i - a)^2 \]is the sample mean \(m(\bs{y})\), and the minimum value of the mean square error is the sample variance \(s^2(\bs{y})\). Thus, the difference between this value of the mean square error and the one in Exercise 9, namely \(s^2(\bs{y}) \, r^2(\bs{y}, \bs{y})\) is the reduction in the variability of the \(y\) data when the linear term in \(x\) is added to the predictor. The fractional reduction is \(r^2(\bs{x}, \bs{y})\), and hence this statistics is called the (sample) coefficient of determination. Note that if the data vectors \(\bs{x}\) and \(\bs{y}\) are uncorrelated, then \(x\) has no value as a predictor of \(y\); the regression line in this case is the horizontal line \(y = m(\bs{y})\) and the mean square error is \(s^2(\bs{y})\).
The choice of predictor and response variables is important.
The sample regression line with predictor variable \(x\) and response variable \(y\) is not the same as the sample regression line with predictor variable \(y\) and response variable \(x\), except in the trivial case where the sample points all lie on a line.
The difference between the actual \(y\) value of a data point and the value predicted by the regression line is called the residual of that data point. Thus, the residual corresponding to \((x_i, y_i)\) is
\[ d_i = y_i - \left( m(\bs{y}) + \frac{s(\bs{x}, \bs{y})}{x^2(\bs{x})} [x_i - m(\bs{x})] \right) \]Various plots of the residuals can help one understand the relationship between the \(x\) and \(y\) data. Some of the more common are
Linear regression is a much more powerful idea than might first appear. By applying various transformations to \(y\) or \(x\) or both, we can fit a variety of two-parameter curves to the given data \(((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n))\). In this section, we will consider some of the most common transformations.
Consider the function \(y = a + b \, x^2\).
Consider the function \(y = \frac{1}{a + b \, x}\).
Consider the function \(y = \frac{x}{a + b \, x}\).
Consider the function \(y = a \, e^{b \, x}\).
Consider the function \(y = a \, x^b\).
Suppose now that we have a basic random experiment, and that \(X\) and \(Y\) are real-valued random variables for the experiment. Equivalently, \((X, Y)\) is a random vector taking values in \(\R^2\). Let \(\mu = \E(X)\) and \(\nu = \E(Y)\) denote the distribution means, \(\sigma^2 = \var(X)\) and \(\tau^2 = \var(Y)\) the distribution variances, and let \(\delta = \cov(X, Y)\) denote the distribution covariance, so that the distribution correlation is
\[ \rho = \cor(X, Y) = \frac{\cov(X, Y)}{\sd(X) \, \sd(Y)} = \frac{\delta}{\sigma \, \tau} \]Eventually we will also need a higher order bivariate moment. Let \(\delta_2 = \E[(X - \mu)^2 (Y - \nu)^2]\).
Now suppose that we run the basic experiment \(n\) times. This creates a compound experiment with a sequence of independent random vectors \(((X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n))\) each with the same distribution as \((X, Y)\). In statistical terms, this is a random sample of size \(n\) from the distribution of \((X, Y)\). The statistics discussed above, in the section on descriptive statistics, are well defined but now they are all random variables. We use the notation established above, except that we use our usual convention of denoting random variables with capital letters. Of course, the deterministic properties and relations established in the section on descriptive statistics still hold. Note that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution of \(X\) and \(\bs{Y} = (Y_1, Y_2, \ldots, Y_n)\) is a random sample of size \(n\) from the distribution of \(Y\).
In this section, we will define and study statistics that are natural estimators of the distribution covariance and correlation. As usual, the definitions depend on what other parameters are known and unknown.
Suppose first that the distribution means \(\mu\) and \(\nu\) are known. This is usually an unrealistic assumption, of course, but is still a good place to start because the analysis is very simple and the results we obtain will be useful below. A natural estimator of the distsribution covariance \(\delta = \cov(X, Y)\) in this case is
\[ W(\bs{X}, \bs{Y}) = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)(Y_i - \nu) \]\(W(\bs{X}, \bs{Y})\) is the sample mean for a random sample of size \(n\) from the distribution of \((X - \mu)(Y - \nu)\) and satisfies the following properties:
As an estimator of \(\delta\), part (a) means that \(W(\bs{X}, \bs{Y})\) is unbiased and part (b) means that \(W(\bs{X}, \bs{Y})\) is consistent.
Consider now the more realistic assumption that the distribution means \(\mu\) and \(\nu\) are unknown. A natural approach in this case is to average \([(X_i - M(\bs{X})][Y_i - M(\bs{Y})]\) over \(i \in \{1, 2, \ldots, n\}\). But rather than dividing by \(n\) in our average, we should divide by whatever constant gives an unbiased estimator of \(\delta\). This constant turns out to be \(n - 1\), leading to the standard sample covariance:
\[ S(\bs{X}, \bs{Y}) = \frac{1}{n - 1} \sum_{i=1}^n [X_i - M(\bs{X}][Y_i - M(\bs{Y})] \]\(\cov[M(\bs{X}), M(\bs{Y})] = \delta / n\).
Use the bilinearity of the covariance operator.
\(\E[S(\bs{X}, \bs{Y})] = \delta\).
Use the previous exercise and Exercise 1.
\(S(\bs{X}, \bs{Y}) \to \delta\) as \(n \to \infty\) with probability 1.
Use Exercise 1 and the strong law of large numbers.
Of courese, the sample correlation is
\[ R(\bs{X}, \bs{Y}) = \frac{S(\bs{X}, \bs{Y})}{S(\bs{X} \, S(\bs{Y})} \]Since the sample correlation \(R(\bs{X}, \bs{Y})\) is a nonlinear function of the sample covariance and sample standard deviations, it will not in general be an unbiased estimator of the distribution correlation \(\rho\). In most cases, it would be difficult to even compute the mean and variance of \(R(\bs{X}, \bs{Y})\). Nonetheless, we can show convergence of the sample correlation to the distribution correlation.
\(R(\bs{X}, \bs{Y}) \to \rho\) as \(n \to \infty\) with probability 1.
Use the strong law of large numbers.
In this subsection we will derive a formuala for the variance of the sample covariance. The derivation was contributed by Ranjith Unnikrishnan, and is similar to the derivation of the variance of the sample variance.
The sample covariance can be written in terms of the sample variables as follows:
\[ S(\bs{X}, \bs{Y}) = \frac{1}{2 \, n \, (n - 1)} \sum_{i=1}^n \sum_{j=1}^n (X_i - X_j)(Y_i - Y_j) \]Start with the expression on the right. Expand the product \((X_i - X_j)(Y_i - Y_j)\), and take the sums term by term.
The variance of the sample covariance is
\[ \var[S(\bs{X}, \bs{Y})] = \frac{1}{n} \left( \delta_2 + \frac{1}{n - 1} \sigma^2 \, \tau^2 - \frac{n - 2}{n - 1} \delta^2 \right) \]Note that \(\var[S(\bs{X}, \bs{Y})]\) is the sum of all of the pairwise covariances of the terms in the expansion in the last exercise. First, \[ \cov[(X_i - X_j)(Y_i - Y_j), (X_k - X_l)(Y_k - Y_l)] = 0 \] if \(i = k\) or \(k = l\) or if \(i\), \(j\), \(k\), \(l\) are distinct. Next, \[ \cov[(X_i - X_j)(Y_i - Y_j), (X_i - X_j)(Y_i - Y_j)] = 2 \, \delta_2 + 2 \, \sigma^2 \, \tau^2 \] if \(i \ne j\), and there are \(2 \, n \, (n - 1)\) such terms in the sum of covariances. Finally, \[ \cov[(X_i - X_j)(Y_i - Y_j), (X_k - X_j)(Y_k - Y_j)] = \delta_2 - \delta^2 \] if \(i\), \(j\), \(k\) are distinct, and there are \(4 \, n \, (n - 1) \, (n - 2)\) such terms in the sum of covariances.
\(\var[S(\bs{X}, \bs{Y})] \gt \var[W(\bs{X}, \bs{Y})]\).
\(\var[S(\bs{X}, \bs{Y})] \to 0\) as \(n \to \infty\). Thus, the sample covariance is a consistent estimator of the distribution covariance.
In the subsection on descriptive statistics, we studied regression from a deterministic, descriptive point of view. The results obtained applied only to the sample. Statistically more interesting and deeper questions arise when the data come from a random experiment, and we try to draw inferences about the underlying distribution from the sample regression. There are two models that commonly arise. One is where the response variable is random, but the predictor variable is deterministic. The other is the model of this subsection, where the predictor variable and the response variable are both random, so that the data form a random sample from a bivariate distribution.
Thus, consider again the model at the beginning of this section, where we have a basic random vector \((X, Y)\) for an experiment. Recall that in the section on (distribution) correlation and regression, we showed that the best linear predictor of \(Y\) given \(X\), in the sense of minimizing mean square error, is the random variable
\[ L(Y \mid X) = \E(Y) + \frac{\cov(X, Y)}{\var(X)}[X - \E(X)] = \nu + \frac{\delta}{\sigma^2}(X - \mu) \]so that the distribution regression line is given by
\[ y = L(Y \mid X = x) = \nu + \frac{\delta}{\sigma^2}(x - \mu) \]Moreover, the (minimum) value of the mean square error is \(\E\{[Y - L(Y \mid X)]\} = \var(Y)[1 - \cor^2(X, Y)] = r^2 (1 - \rho^2)\).
Of course, in real applications, we are unlikely to know the distribution parameters \(\mu\), \(\nu\), \(\sigma^2\), and \(\delta\). If we want to estimate the distribution regression line, a natural approach would be to consider a random sample \(((X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n))\) from the distribution of \((X, Y)\) and compute the sample regression line. Of course, the results are exactly the same as in our discussion of regression from a descriptive point of view, except that all of the relevant quantities are random variables. The sample regression line is
\[ y = M(\bs{Y}) + \frac{S(\bs{X}, \bs{Y})}{S^2(\bs{X})}[x - M(\bs{X})] \]The mean square error is \(S^2(\bs{Y})[1 - R^2(\bs{X}, \bs{Y})]\) and the coefficient of determination is \(R^2(\bs{X}, \bs{Y})\).
The fact that the sample regression line and mean square error are completely analogous to the distribution regression line and mean square error is mathematically elegant and reassuring. Again, the coefficients of the sample regression line can be viewed as estimators of the respective coefficients in the distribution regression line.
Assuming that the appropriate higher order moments are finite, the coefficients of the sample regression line converge to the coefficients of the distribution regression line with probability 1.
Use the strong law of large numbers.
Of course, if the linear relationship between \(X\) and \(Y\) is not strong, as measured by the mean square error, then one of the transformations discussed above can be applied.
Click in the interactive scatterplot, in various places, and watch how the means, standard deviations, correlation, and regression line change.
Click in the interactive scatterplot to define 20 points and try to come as close as possible to each of the following sample correlations: \(0\), \(0.5\), \(-0.5\), \(0.7\), \(-0.7\), \(0.9\), \(-0.9\).
Click in the interactive scatterplot to define 20 points. Try to generate a scatterplot in which the regression line has
Run the bivariate uniform experiment 2000 times in each of the following cases. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to distribution regression line.
Run the bivariate normal experiment 2000 times for various values of the distribution standard deviations and the distribution correlation. Note the apparent convergence of the sample means to the distribution means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution correlation, and the sample regression line to the distribution regression line.
Consider Pearson's height data, with the height of the father as the predictor variable and the height of the son as the response variable.
Compute the correlation between petal length and petal width for the following cases in Fisher's iris data.
Consider the M&M data with number of candies as the predictor variable and net weight as the response variable.
Consider the SAT by state data set with response rate as the predictor variable and total SAT score as the response variable.
Consider the SAT by year data set with verbal score (all students) as the predictor variable and math score (all students) as the response variable.
Consider the first data set in the Challenger data with temperature as the predictor variable and O-ring erosion as the response variable.
Consider the second data set in the Challenger data.