Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that we make on these objects. We select objects from the population and record the variables for the objects in the sample; these become our data. Once again, our first discussion is from a descriptive point of view. That is, we do not assume that the data are generated by an underlying probability distribution. Remember however, that the data themselves form a probability distribution.
Suppose that \(\bs{x} = (x_1, x_2, \ldots, x_n)\) is a sample of size \(n\) from a real-valued variable \(x\). Recall that the sample mean is
\[ m = \frac{1}{n} \sum_{i=1}^n x_i \]and is the most important measure of the center of the data set. The sample variance is defined to be
\[ s^2 = \frac{1}{n - 1} \sum_{i=1}^n (x_i - m)^2 \]If we need to indicate the dependence on the data vector \(\bs{x}\), we write \(s^2(\bs{x})\). The difference \(x_i - m\) is the deviation of \(x_i\) from the mean \(m\) of the data set. Thus, the variance is the mean square deviation and is a measure of the spread of the data set with respet to the mean. The reason for dividing by \(n - 1\) rather than \(n\) is best understood in terms of the inferential point of view that we discuss in the next section; this definition makes the sample variance an unbiased estimator of the distribution variance. However, the reason for the averaging can also be understood in terms of a related concept.
\(\sum_{i=1}^n (x_i - m) = 0\).
\(\sum_{i=1}^n (x_i - m) = \sum_{i=1}^n x_i - \sum_{i=1}^n m = n m - n m = 0\).
Thus, if we know \(n - 1\) of the deviations, we can compute the last one. This means that there are only \(n - 1\) freely varying deviations, that is to say, \(n - 1\) degrees of freedom in the set of deviations. In the definition of sample variance, we average the squared deviations, not by dividing by the number of terms, but rather by dividing by the number of degrees of freedom in those terms. However, this argument notwithstanding, it would be reasonable, from a purely descriptive point of view, to divide by \(n\) in the definition of the sample variance. Moreover, when \(n\) is sufficiently large, it hardly matters whether we divide by \(n\) or by \(n - 1\).
In any event, the square root \(s\) of the sample variance \(s^2\) is the sample standard deviation. It is the root mean square deviation and is also a measure of the spread of the data with respect to the mean. Both measures of spread are important. Variance has nicer mathematical properties, but its physical unit is the square of the unit of \(x\). For example, if the underlying variable \(x\) is the height of a person in inches, the variance is in square inches. On the other hand, the standard deviation has the same physical unit as the original variable, but its mathematical properties are not as nice.
Recall that the data set \(\bs{x}\) naturally gives rise to a probability distribution, namely the empirical distribution that places probability \(\frac{1}{n}\) at \(x_i\) for each \(i\). Thus, if the data are distinct, this is the uniform distribution on \(\{x_1, x_2, \ldots, x_n\}\). The sample mean \(m\) is simply the expected value of the empirical distribution. Similarly, if we were to divide by \(n\) rather than \(n - 1\), the sample variance would be the variance of the empirical distribution. Most of the properties and results this section follow from much more general properties and results for the variance of a probability distribution (although for the most part, we give independent proofs).
Measures of center and measures of spread are best thought of together, in the context of an error function. The error function measures how well a single number \(a\) represents the entire data set \(\bs{x}\). The values of \(a\) (if they exist) that minimize the error functions are our measures of center; the minimum value of the error function is the corresponding measure of spread. Of course, we hope for a single value of \(a\) that minimizes the error function, so that we have a unique measure of center.
Let's apply this procedure to the mean square error function defined by
\[ \mse(a) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - a)^2, \quad a \in \R \]Minimizing \(\mse\) is a standard problem in calculus.
The graph of \(\mse\) is a parabola opening upward.
We can tell from the form of \(\mse\) that the graph is a parabola opening upward. Taking the derivative gives
\[ \frac{d}{da} \mse(a) = -\frac{2}{n - 1}\sum_{i=1}^n (x_i - a) = -\frac{2}{n - 1}(n m - n a) \]Hence \(a = m\) is the unique value that minimizes \(\mse\). Of course, \(\mse(m) = s^2\).
Trivially, if we defined the mean square error function by dividing by \(n\) rather than \(n - 1\), then the minimum value would still occur at \(m\), the sample mean, but the minimum value would be the alternate version of the sample variance in which we divide by \(n\). On the other hand, if we were to use the root mean square deviation function \(\text{rmse}(a) = \sqrt{\mse(a)}\), then because the square root function is strictly increasing on \([0, \infty)\), the minimum value would again occur at \(m\), the sample mean, but the minimum value would be \(s\), the sample standard deviation. The important point is that with all of these error functions, the unique measure of center is the sample mean, and the corresponding measures of spread are the various ones that we are studying.
Next, let's apply our procedure to the mean absolute error function defined by
\[ \mae(a) = \frac{1}{n - 1} \sum_{i=1}^n |x_i - a|, \quad a \in \R \]The mean absolute error function satisfies the following properties:
For parts (a) and (b), note that for each \(i\), \(|x_i - a|\) is a continuous function of \(a\) with the graph consisting of two lines (of slopes \(\pm 1\)) meeting at \(x_i\).
Mathematically, \(\mae\) has some problems as an error function. First, the function will not be smooth (differentiable) at points where two lines of different slopes meet. More importantly, the values that minimize mae may occupy an entire interval, thus leaving us without a unique measure of center. The error function exercises below will show you that these pathologies can really happen. It turns out that \(\mae\) is minimized at any point in the median interval of the data set \(\bs{x}\). The proof of this result follows from a much more general result for probability distributions. Thus, the medians are the natural measures of center associated with \(\mae\) as a measure of error, in the same way that the sample mean is the measure of center associated with the \(\mse\) as a measure of error.
In this section, we establish some essential properties of the sample variance and standard deviation. First, the following alternate formula for the sample variance is better for computational purposes, and for certain theoretical purposes as well.
The sample variance can be computed as
\[ s^2 = \frac{1}{n - 1} \sum_{i=1}^n x_i^2 - \frac{n}{n - 1} m^2 \]Note that
\[\begin{align} \sum_{i=1}^n (x_i - m)^2 & = \sum_{i=1}^n (x_i^2 - 2 m x_i + m^2) = \sum_{i=1}^n x_i^2 - 2 m \sum_{i=1}^n x_i - \sum_{i=1}^n m\\ & = \sum_{i=1}^n x_i^2 - 2 n m^2 + n m^2 = \sum_{i=1}^n x_i^2 - n m^2 \end{align} \]Dividing by \(n - 1\) gives the result.
If we let \(\bs{x}^2 = (x_1^2, x_2^2, \ldots, x_n^2)\) denote the sample from the variable \(x^2\), then the computational formula in the last exercise can be written succinctly as
\[ s^2(\bs{x}) = \frac{n}{n - 1} [m(\bs{x}^2) - m^2(\bs{x})] \]The following theorem gives another computational formula for the sample variance, directly in terms of the variables and thus without the computation of an intermediate statistic.
The sample variance can be computed as
\[ s^2 = \frac{1}{2 n (n - 1)} \sum_{i=1}^n \sum_{j=1}^n (x_i - x_j)^2 \]Note that
\[ \begin{align} \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i - x_j)^2 & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i - m + m - x_j)^2 \\ & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n [(x_i - m)^2 + 2 (x_i - m)(m - x_j) + (m - x_j)^2] \\ & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i - m)^2 + \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n (x_i - m)(m - x_j) + \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (m - x_j)^2 \\ & = \frac{1}{2} \sum_{i=1}^n (x_i - m)^2 + 0 + \frac{1}{2} \sum_{j=1}^n (m - x_j)^2 \\ & = \sum_{i=1}^n (x_i - m)^2 \end{align} \]Dividing by \(n - 1\) gives the result.
The sample variance is nonnegative:
Part (a) is obvious. For part (b) note that if \(s^2 = 0\) then \(x_i = m\) for each \(i\). Conversely, if \(\bs{x}\) is a constant vector, then \(m\) is that same constant.
Thus, \(s^2 = 0\) if and only if the data set is constant (and then, of course, the mean is the common value).
If \(c\) is a constant then
For part (a), recall that \(m(c \bs{x}) = c m(\bs{x})\). Hence
\[ s^2(c \bs{x}) = \frac{1}{n - 1}\sum_{i=1}^n [c x_i - c m(\bs{x})]^2 = \frac{1}{n - 1} \sum_{i=1}^n c^2 [x_i - m(\bs{x})]^2 = c^2 s^2(\bs{x}) \]If \(\bs{c}\) is a sample of size \(n\) from a constant \(c\) then
Recall that \(m(\bs{x} + \bs{c}) = m(\bs{x}) + c\). Hence
\[ s^2(\bs{x} + \bs{c}) = \frac{1}{n - 1} \sum_{i=1}^n \{(x_i + c) - [m(\bs{x}) + c]\}^2 = \frac{1}{n - 1} \sum_{i=1}^n [x_i - m(\bs{x})]^2 = s^2(\bs{x})\]As a special case of these results, suppose that \(\bs{x} = (x_1, x_2, \ldots, x_n)\) is a sample of size \(n\) corresponding to a real variable \(x\), and that \(a\) and \(b\) are constants. The sample corresponding to the variable \(y = a + b x\), in our vector notation, is \(\bs{a} + b \bs{x}\). Then \(m(\bs{a} + b \bs{x}) = a + b m(\bs{x})\) and \(s(\bs{a} + b \bs{x}) = |b| s(\bs{x})\). Linear transformations of this type, when \(b \gt 0\), arise frequently when physical units are changed. In this case, the transformation is often called a location-scale transformation; \(a\) is the location parameter and \(b\) is the scale parameter. For example, if \(x\) is the length of an object in inches, then \(y = 2.54 x\) is the length of the object in centimeters. If \(x\) is the temperature of an object in degrees Fahrenheit, then \(y = \frac{5}{9}(x - 32)\) is the temperature of the object in degree Celsius.
Now, for \(i \in \{1, 2, \ldots, n\}\), let \( z_i = (x_i - m) / s\). The number \(z_i\) is the standard score associated with \(x_i\). Note that since \(x_i\), \(m\), and \(s\) have the same physical units, the standard score \(z_i\) is dimensionless (that is, has no physical units); it measures the directed distance from the mean \(m\) to the data value \(x_i\) in standard deviations.
The sample of standard scores \(\bs{z} = (z_1, z_2, \ldots, z_n)\) has mean 0 and variance 1. That is,
These results follow from Theroems 7 and 8. In vector notation, note that \(\bs{z} = (\bs{x} - \bs{m})/s\). Hence \(m(\bs{z}) = (m - m) / s = 0\) and \(s(\bs{z}) = s / s = 1\).
Suppose that instead of the actual data \(\bs{x}\), we have a frequency distribution corresponding to a partition with classes (intervals) \((A_1, A_2, \ldots, A_k)\), class marks (midpoints of the intervals) \((t_1, t_2, \ldots, t_k)\), and frequencies \((n_1, n_2, \ldots, n_k)\). Recall that the relative frequency of class \(A_j\) is \(p_j = n_j / n\). In this case, approximate values of the sample mean and variance are, respectively,
\[ \begin{align} m & = \frac{1}{n} \sum_{j=1}^k n_j \, t_j = \sum_{j = 1}^k p_j \, t_j \\ s^2 & = \frac{1}{n - 1} \sum_{j=1}^k n_j (t_j - m)^2 = \frac{n}{n - 1} \sum_{j=1}^k p_j (t_j - m)^2 \end{align} \]These approximations are based on the hope that the data values in each class are well represented by the class mark. In fact, these are the standard definitions of sample mean and variance for the data set in which \(t_j\) occurs \(n_j\) times for each \(j\).
Suppose that \(x\) is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of operation. A sample of 30 components has mean 113° and standard deviation \(18°\).
Suppose that \(x\) is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has mean 10.0 and standard deviation 2.0.
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade on the first midterm exam was 64 (out of a possible 100 points) and the standard deviation was 16. Professor Moriarity thinks the grades are a bit low and is considering various transformations for increasing the grades. In each case below give the mean and standard deviation of the transformed grades, or state that there is not enough information.
One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an outlier.
All statistical software packages will compute means, variances and standard deviations, draw dotplots and histograms, and in general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal technological aids.
Suppose that \(x\) is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data \(\bs{x} = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)\).
\(i\) | \(x_i\) | \(x_i - m\) | \((x_i - m)^2\) |
---|---|---|---|
\(1\) | \(3\) | \(1\) | \(1\) |
\(2\) | \(1\) | \(-1\) | \(1\) |
\(3\) | \(2\) | \(0\) | \(0\) |
\(4\) | \(0\) | \(-2\) | \(4\) |
\(5\) | \(2\) | \(0\) | \(0\) |
\(6\) | \(4\) | \(2\) | \(4\) |
\(7\) | \(3\) | \(1\) | \(1\) |
\(8\) | \(2\) | \(0\) | \(0\) |
\(9\) | \(1\) | \(-1\) | \(1\) |
\(10\) | \(2\) | \(0\) | \(0\) |
Total | 20 | 0 | 14 |
Mean | 2 | 0 | \(14/9\) |
Suppose that a sample of size 12 from a discrete variable \(x\) has empirical density function given by \(f(-2) = 1/12\), \(f(-1) = 1/4\), \(f(0) = 1/3\), \(f(1) = 1/6\), \(f(2) = 1/6\).
The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample of ESU students.
Class | Freq | Rel Freq | Density | Cum Freq | Cum Rel Freq | Midpoint |
---|---|---|---|---|---|---|
\((0, 2]\) | 6 | |||||
\((2, 6]\) | 16 | |||||
\((6, 10]\) | 18 | |||||
\((10, 20])\) | 10 | |||||
Total |
Class | Freq | Rel Freq | Density | Cum Freq | Cum Rel Freq | Midpoint |
---|---|---|---|---|---|---|
\((0, 2]\) | 6 | 0.12 | 0.06 | 6 | 0.12 | 1 |
\((2, 6]\) | 16 | 0.32 | 0.08 | 22 | 0.44 | 4 |
\((6, 10]\) | 18 | 0.36 | 0.09 | 40 | 0.80 | 8 |
\((10, 20]\) | 10 | 0.20 | 0.02 | 50 | 1 | 15 |
Total | 50 | 1 |
In the error function applet, select root mean square error. As you add points, note the shape of the graph of the error function, the value that minimizes the function, and the minimum value of the function.
In the error function applet, select mean absolute error. As you add points, note the shape of the graph of the error function, the values that minimizes the function, and the minimum value of the function.
Suppose that our data vector is \((2, 1, 5, 7)\). Explicitly give \(\mae\) as a piecewise function and sketch its graph. Note that
Suppose that our data vector is \((3, 5, 1)\). Explicitly give \(\mae\) as a piecewise function and sketch its graph. Note that
Statistical software should be used for the problems in this subsection.
Consider the petal length and species variables in Fisher's iris data.
Consider the erosion variable in the Challenger data set.
Consider Michelson's velocity of light data.
Consider Short's paralax of the sun data.
Consider Cavendish's density of the earth data.
Consider the M&M data.
Consider the body weight, species, and gender variables in the Cicada data.
Consider Pearson's height data.