
## 5. The Sample Variance I

Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that we make on these objects. We select objects from the population and record the variables for the objects in the sample; these become our data. Once again, our first discussion is from a descriptive point of view. That is, we do not assume that the data are generated by an underlying probability distribution. Remember however, that the data themselves form a probability distribution.

#### Variance and Standard Deviation

Suppose that $$\bs{x} = (x_1, x_2, \ldots, x_n)$$ is a sample of size $$n$$ from a real-valued variable $$x$$. Recall that the sample mean is

$m = \frac{1}{n} \sum_{i=1}^n x_i$

and is the most important measure of the center of the data set. The sample variance is defined to be

$s^2 = \frac{1}{n - 1} \sum_{i=1}^n (x_i - m)^2$

If we need to indicate the dependence on the data vector $$\bs{x}$$, we write $$s^2(\bs{x})$$. The difference $$x_i - m$$ is the deviation of $$x_i$$ from the mean $$m$$ of the data set. Thus, the variance is the mean square deviation and is a measure of the spread of the data set with respet to the mean. The reason for dividing by $$n - 1$$ rather than $$n$$ is best understood in terms of the inferential point of view that we discuss in the next section; this definition makes the sample variance an unbiased estimator of the distribution variance. However, the reason for the averaging can also be understood in terms of a related concept.

$$\sum_{i=1}^n (x_i - m) = 0$$.

Proof:

$$\sum_{i=1}^n (x_i - m) = \sum_{i=1}^n x_i - \sum_{i=1}^n m = n m - n m = 0$$.

Thus, if we know $$n - 1$$ of the deviations, we can compute the last one. This means that there are only $$n - 1$$ freely varying deviations, that is to say, $$n - 1$$ degrees of freedom in the set of deviations. In the definition of sample variance, we average the squared deviations, not by dividing by the number of terms, but rather by dividing by the number of degrees of freedom in those terms. However, this argument notwithstanding, it would be reasonable, from a purely descriptive point of view, to divide by $$n$$ in the definition of the sample variance. Moreover, when $$n$$ is sufficiently large, it hardly matters whether we divide by $$n$$ or by $$n - 1$$.

In any event, the square root $$s$$ of the sample variance $$s^2$$ is the sample standard deviation. It is the root mean square deviation and is also a measure of the spread of the data with respect to the mean. Both measures of spread are important. Variance has nicer mathematical properties, but its physical unit is the square of the unit of $$x$$. For example, if the underlying variable $$x$$ is the height of a person in inches, the variance is in square inches. On the other hand, the standard deviation has the same physical unit as the original variable, but its mathematical properties are not as nice.

Recall that the data set $$\bs{x}$$ naturally gives rise to a probability distribution, namely the empirical distribution that places probability $$\frac{1}{n}$$ at $$x_i$$ for each $$i$$. Thus, if the data are distinct, this is the uniform distribution on $$\{x_1, x_2, \ldots, x_n\}$$. The sample mean $$m$$ is simply the expected value of the empirical distribution. Similarly, if we were to divide by $$n$$ rather than $$n - 1$$, the sample variance would be the variance of the empirical distribution. Most of the properties and results this section follow from much more general properties and results for the variance of a probability distribution (although for the most part, we give independent proofs).

#### Measures of Center and Spread

Measures of center and measures of spread are best thought of together, in the context of an error function. The error function measures how well a single number $$a$$ represents the entire data set $$\bs{x}$$. The values of $$a$$ (if they exist) that minimize the error functions are our measures of center; the minimum value of the error function is the corresponding measure of spread. Of course, we hope for a single value of $$a$$ that minimizes the error function, so that we have a unique measure of center.

Let's apply this procedure to the mean square error function defined by

$\mse(a) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - a)^2, \quad a \in \R$

Minimizing $$\mse$$ is a standard problem in calculus.

The graph of $$\mse$$ is a parabola opening upward.

1. $$\mse$$ is minimized when $$a = m$$, the sample mean.
2. The minimum value of $$\mse$$ is $$s^2$$, the sample variance.
Proof:

We can tell from the form of $$\mse$$ that the graph is a parabola opening upward. Taking the derivative gives

$\frac{d}{da} \mse(a) = -\frac{2}{n - 1}\sum_{i=1}^n (x_i - a) = -\frac{2}{n - 1}(n m - n a)$

Hence $$a = m$$ is the unique value that minimizes $$\mse$$. Of course, $$\mse(m) = s^2$$.

Trivially, if we defined the mean square error function by dividing by $$n$$ rather than $$n - 1$$, then the minimum value would still occur at $$m$$, the sample mean, but the minimum value would be the alternate version of the sample variance in which we divide by $$n$$. On the other hand, if we were to use the root mean square deviation function $$\text{rmse}(a) = \sqrt{\mse(a)}$$, then because the square root function is strictly increasing on $$[0, \infty)$$, the minimum value would again occur at $$m$$, the sample mean, but the minimum value would be $$s$$, the sample standard deviation. The important point is that with all of these error functions, the unique measure of center is the sample mean, and the corresponding measures of spread are the various ones that we are studying.

Next, let's apply our procedure to the mean absolute error function defined by

$\mae(a) = \frac{1}{n - 1} \sum_{i=1}^n |x_i - a|, \quad a \in \R$

The mean absolute error function satisfies the following properties:

1. $$\mae$$ is a continuous function.
2. The graph of $$\mae$$ consists of lines.
3. The slope of the line at $$a$$ depends on where $$a$$ is in the data set $$\bs{x}$$.
Proof:

For parts (a) and (b), note that for each $$i$$, $$|x_i - a|$$ is a continuous function of $$a$$ with the graph consisting of two lines (of slopes $$\pm 1$$) meeting at $$x_i$$.

Mathematically, $$\mae$$ has some problems as an error function. First, the function will not be smooth (differentiable) at points where two lines of different slopes meet. More importantly, the values that minimize mae may occupy an entire interval, thus leaving us without a unique measure of center. The error function exercises below will show you that these pathologies can really happen. It turns out that $$\mae$$ is minimized at any point in the median interval of the data set $$\bs{x}$$. The proof of this result follows from a much more general result for probability distributions. Thus, the medians are the natural measures of center associated with $$\mae$$ as a measure of error, in the same way that the sample mean is the measure of center associated with the $$\mse$$ as a measure of error.

#### Properties

In this section, we establish some essential properties of the sample variance and standard deviation. First, the following alternate formula for the sample variance is better for computational purposes, and for certain theoretical purposes as well.

The sample variance can be computed as

$s^2 = \frac{1}{n - 1} \sum_{i=1}^n x_i^2 - \frac{n}{n - 1} m^2$
Proof:

Note that

\begin{align} \sum_{i=1}^n (x_i - m)^2 & = \sum_{i=1}^n (x_i^2 - 2 m x_i + m^2) = \sum_{i=1}^n x_i^2 - 2 m \sum_{i=1}^n x_i - \sum_{i=1}^n m\\ & = \sum_{i=1}^n x_i^2 - 2 n m^2 + n m^2 = \sum_{i=1}^n x_i^2 - n m^2 \end{align}

Dividing by $$n - 1$$ gives the result.

If we let $$\bs{x}^2 = (x_1^2, x_2^2, \ldots, x_n^2)$$ denote the sample from the variable $$x^2$$, then the computational formula in the last exercise can be written succinctly as

$s^2(\bs{x}) = \frac{n}{n - 1} [m(\bs{x}^2) - m^2(\bs{x})]$

The following theorem gives another computational formula for the sample variance, directly in terms of the variables and thus without the computation of an intermediate statistic.

The sample variance can be computed as

$s^2 = \frac{1}{2 n (n - 1)} \sum_{i=1}^n \sum_{j=1}^n (x_i - x_j)^2$
Proof:

Note that

\begin{align} \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i - x_j)^2 & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i - m + m - x_j)^2 \\ & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n [(x_i - m)^2 + 2 (x_i - m)(m - x_j) + (m - x_j)^2] \\ & = \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (x_i - m)^2 + \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n (x_i - m)(m - x_j) + \frac{1}{2 n} \sum_{i=1}^n \sum_{j=1}^n (m - x_j)^2 \\ & = \frac{1}{2} \sum_{i=1}^n (x_i - m)^2 + 0 + \frac{1}{2} \sum_{j=1}^n (m - x_j)^2 \\ & = \sum_{i=1}^n (x_i - m)^2 \end{align}

Dividing by $$n - 1$$ gives the result.

The sample variance is nonnegative:

1. $$s^2 \ge 0$$
2. $$s^2 = 0$$ if and only if $$x_i = x_j$$ for each $$i, \; j \in \{1, 2, \ldots, n\}$$.
Proof:

Part (a) is obvious. For part (b) note that if $$s^2 = 0$$ then $$x_i = m$$ for each $$i$$. Conversely, if $$\bs{x}$$ is a constant vector, then $$m$$ is that same constant.

Thus, $$s^2 = 0$$ if and only if the data set is constant (and then, of course, the mean is the common value).

If $$c$$ is a constant then

1. $$s^2(c \, \bs{x}) = c^2 \, s^2(\bs{x})$$
2. $$s(c \, \bs{x}) = |c| \, s(\bs{x})$$
Proof:

For part (a), recall that $$m(c \bs{x}) = c m(\bs{x})$$. Hence

$s^2(c \bs{x}) = \frac{1}{n - 1}\sum_{i=1}^n [c x_i - c m(\bs{x})]^2 = \frac{1}{n - 1} \sum_{i=1}^n c^2 [x_i - m(\bs{x})]^2 = c^2 s^2(\bs{x})$

If $$\bs{c}$$ is a sample of size $$n$$ from a constant $$c$$ then

1. $$s^2(\bs{x} + \bs{c}) = s^2(\bs{x})$$.
2. $$s(\bs{x} + \bs{c}) = s(\bs{x})$$
Proof:

Recall that $$m(\bs{x} + \bs{c}) = m(\bs{x}) + c$$. Hence

$s^2(\bs{x} + \bs{c}) = \frac{1}{n - 1} \sum_{i=1}^n \{(x_i + c) - [m(\bs{x}) + c]\}^2 = \frac{1}{n - 1} \sum_{i=1}^n [x_i - m(\bs{x})]^2 = s^2(\bs{x})$

As a special case of these results, suppose that $$\bs{x} = (x_1, x_2, \ldots, x_n)$$ is a sample of size $$n$$ corresponding to a real variable $$x$$, and that $$a$$ and $$b$$ are constants. The sample corresponding to the variable $$y = a + b x$$, in our vector notation, is $$\bs{a} + b \bs{x}$$. Then $$m(\bs{a} + b \bs{x}) = a + b m(\bs{x})$$ and $$s(\bs{a} + b \bs{x}) = |b| s(\bs{x})$$. Linear transformations of this type, when $$b \gt 0$$, arise frequently when physical units are changed. In this case, the transformation is often called a location-scale transformation; $$a$$ is the location parameter and $$b$$ is the scale parameter. For example, if $$x$$ is the length of an object in inches, then $$y = 2.54 x$$ is the length of the object in centimeters. If $$x$$ is the temperature of an object in degrees Fahrenheit, then $$y = \frac{5}{9}(x - 32)$$ is the temperature of the object in degree Celsius.

Now, for $$i \in \{1, 2, \ldots, n\}$$, let $$z_i = (x_i - m) / s$$. The number $$z_i$$ is the standard score associated with $$x_i$$. Note that since $$x_i$$, $$m$$, and $$s$$ have the same physical units, the standard score $$z_i$$ is dimensionless (that is, has no physical units); it measures the directed distance from the mean $$m$$ to the data value $$x_i$$ in standard deviations.

The sample of standard scores $$\bs{z} = (z_1, z_2, \ldots, z_n)$$ has mean 0 and variance 1. That is,

1. $$m(\bs{z}) = 0$$
2. $$s^2(\bs{z}) = 1$$
Proof:

These results follow from Theroems 7 and 8. In vector notation, note that $$\bs{z} = (\bs{x} - \bs{m})/s$$. Hence $$m(\bs{z}) = (m - m) / s = 0$$ and $$s(\bs{z}) = s / s = 1$$.

#### Approximating the Variance

Suppose that instead of the actual data $$\bs{x}$$, we have a frequency distribution corresponding to a partition with classes (intervals) $$(A_1, A_2, \ldots, A_k)$$, class marks (midpoints of the intervals) $$(t_1, t_2, \ldots, t_k)$$, and frequencies $$(n_1, n_2, \ldots, n_k)$$. Recall that the relative frequency of class $$A_j$$ is $$p_j = n_j / n$$. In this case, approximate values of the sample mean and variance are, respectively,

\begin{align} m & = \frac{1}{n} \sum_{j=1}^k n_j \, t_j = \sum_{j = 1}^k p_j \, t_j \\ s^2 & = \frac{1}{n - 1} \sum_{j=1}^k n_j (t_j - m)^2 = \frac{n}{n - 1} \sum_{j=1}^k p_j (t_j - m)^2 \end{align}

These approximations are based on the hope that the data values in each class are well represented by the class mark. In fact, these are the standard definitions of sample mean and variance for the data set in which $$t_j$$ occurs $$n_j$$ times for each $$j$$.

### Exercises

#### Basic Properties

Suppose that $$x$$ is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of operation. A sample of 30 components has mean 113° and standard deviation $$18°$$.

1. Classify $$x$$ by type and level of measurement.
2. Find the sample mean and standard deviation if the temperature is converted to degrees Celsius. The transformation is $$y = \frac{5}{9}(x - 32)$$.
1. continuous, interval
2. $$m = 45°$$, $$s = 10°$$

Suppose that $$x$$ is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has mean 10.0 and standard deviation 2.0.

1. Classify $$x$$ by type and level of measurement.
2. Find the sample mean if length is measured in centimeters. The transformation is $$y = 2.54 x$$.
1. continuous, ratio
2. $$m = 25.4$$, $$s = 5.08$$

Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade on the first midterm exam was 64 (out of a possible 100 points) and the standard deviation was 16. Professor Moriarity thinks the grades are a bit low and is considering various transformations for increasing the grades. In each case below give the mean and standard deviation of the transformed grades, or state that there is not enough information.

1. Add 10 points to each grade, so the transformation is $$y = x + 10$$.
2. Multiply each grade by 1.2, so the transformation is $$z = 1.2 x$$
3. Use the transformation $$w = 10 \sqrt{x}$$. Note that this is a non-linear transformation that curves the grades greatly at the low end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60.

One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an outlier.

1. Find the mean and standard deviation if this score is omitted.
1. $$m = 74$$, $$s = 16$$
2. $$m = 76.8$$, $$s = 19.2$$
3. Not enough information
4. $$m = 66.25$$, $$s = 11.62$$

#### Computational Exercises

All statistical software packages will compute means, variances and standard deviations, draw dotplots and histograms, and in general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal technological aids.

Suppose that $$x$$ is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data $$\bs{x} = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)$$.

1. Classify $$x$$ by type and level of measurement.
2. Sketch the dotplot.
3. Construct a table with rows corresponding to cases and columns corresponding to $$i$$, $$x_i$$, $$x_i - m$$, and $$(x_i - m)^2$$. Add rows at the bottom in the $$i$$ column for totals and means.
1. discrete, ratio
2. $$i$$$$x_i$$$$x_i - m$$$$(x_i - m)^2$$
$$1$$$$3$$$$1$$$$1$$
$$2$$$$1$$$$-1$$$$1$$
$$3$$$$2$$$$0$$$$0$$
$$4$$$$0$$$$-2$$$$4$$
$$5$$$$2$$$$0$$$$0$$
$$6$$$$4$$$$2$$$$4$$
$$7$$$$3$$$$1$$$$1$$
$$8$$$$2$$$$0$$$$0$$
$$9$$$$1$$$$-1$$$$1$$
$$10$$$$2$$$$0$$$$0$$
Total20014
Mean20$$14/9$$

Suppose that a sample of size 12 from a discrete variable $$x$$ has empirical density function given by $$f(-2) = 1/12$$, $$f(-1) = 1/4$$, $$f(0) = 1/3$$, $$f(1) = 1/6$$, $$f(2) = 1/6$$.

1. Sketch the graph of $$f$$.
2. Compute the sample mean and variance.
3. Give the sample values, ordered from smallest to largest.
1. $$m = 1/12$$, $$s^2 = 203/121$$
2. $$(-2, -1, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2)$$

The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample of ESU students.

ClassFreqRel FreqDensityCum FreqCum Rel FreqMidpoint
$$(0, 2]$$6
$$(2, 6]$$16
$$(6, 10]$$18
$$(10, 20])$$10
Total
1. Complete the table
2. Sketch the density histogram
3. Sketch the cumulative relative frquency ogive.
4. Compute an approximation to the mean and standard deviation.
1. ClassFreqRel FreqDensityCum FreqCum Rel FreqMidpoint
$$(0, 2]$$60.120.0660.121
$$(2, 6]$$160.320.08220.444
$$(6, 10]$$180.360.09400.808
$$(10, 20]$$100.200.0250115
Total501
2. $$m = 7.28$$, $$s = 4.549$$

#### Error Function Exercises

In the error function applet, select root mean square error. As you add points, note the shape of the graph of the error function, the value that minimizes the function, and the minimum value of the function.

In the error function applet, select mean absolute error. As you add points, note the shape of the graph of the error function, the values that minimizes the function, and the minimum value of the function.

Suppose that our data vector is $$(2, 1, 5, 7)$$. Explicitly give $$\mae$$ as a piecewise function and sketch its graph. Note that

1. All values of $$a \in [2, 5]$$ minimize $$\mae$$.
2. $$\mae$$ is not differentiable at $$a \in \{1, 2, 5, 7\}$$.

Suppose that our data vector is $$(3, 5, 1)$$. Explicitly give $$\mae$$ as a piecewise function and sketch its graph. Note that

1. $$\mae$$ is minimized at $$a = 3$$.
2. $$\mae$$ is not differentiable at $$a \in \{1, 3, 5\}$$.

#### Data Analysis Exercises

Statistical software should be used for the problems in this subsection.

Consider the petal length and species variables in Fisher's iris data.

1. Classify the variables by type and level of measurement.
2. Compute the sample mean and standard deviation, and plot a density histogram for petal length.
3. Compute the sample mean and standard deviation, and plot a density histogram for petal length by species.
1. petal length: continuous, ratio. species: discrete, nominal
2. $$m = 37.8$$, $$s = 17.8$$
3. $$m(0) = 14.6$$, $$s(0) = 1.7$$; $$m(1) = 55.5$$, $$s(1) = 30.5$$; $$m(2) = 43.2$$, $$s(2) = 28.7$$

Consider the erosion variable in the Challenger data set.

1. Classify the variable by type and level of measurement.
2. Compute the mean and standard deviation
3. Plot a density histogram with the classes $$[0, 5)$$, $$[5, 40)$$, $$[40, 50)$$, $$[50, 60)$$.
1. continuous, ratio
2. $$m = 7.7$$, $$s = 17.2$$

Consider Michelson's velocity of light data.

1. Classify the variable by type and level of measurement.
2. Plot a density histogram.
3. Compute the sample mean and standard deviation.
4. Find the sample mean and standard deviation if the variable is converted to $$\text{km}/\text{hr}$$. The transformation is $$y = x + 299\,000$$
1. continuous, interval
2. $$m = 852.4$$, $$s = 79.0$$
3. $$m = 299\,852.4$$, $$s = 79.0$$

Consider Short's paralax of the sun data.

1. Classify the variable by type and level of measurement.
2. Plot a density histogram.
3. Compute the sample mean and standard deviation.
4. Find the sample mean and standard deviation if the variable is converted to degrees. There are 3600 seconds in a degree.
5. Find the sample mean and standard deviation if the variable is converted to radians. There are $$\pi/180$$ radians in a degree.
1. continuous, ratio
2. $$m = 8.616$$, $$s = 0.749$$
3. $$m = 0.00239$$, $$s = 0.000208$$
4. $$m = 0.0000418$$, $$s = 0.00000363$$
1. Classify the variable by type and level of measurement.
2. Compute the sample mean and standard deviation.
3. Plot a density histogram.
1. continuous, ratio
2. $$m = 5.448$$, $$s = 0.221$$

Consider the M&M data.

1. Classify the variables by type and level of measurement.
2. Compute the sample mean and standard deviation for each color count variable.
3. Compute the sample mean and standard deviation for the total number of candies.
4. Plot a relative frequency histogram for the total number of candies.
5. Compute the sample mean and standard deviation, and plot a density histogram for the net weight.
1. color counts: discrete ratio. net weight: continuous ratio.
2. $$m(r) = 9.60$$, $$s(r) = 4.12$$; $$m(g) = 7.40$$, $$s(g) = 0.57$$; $$m(bl) = 7.23$$, $$s(bl) = 4.35$$; $$m(o) = 6.63$$, $$s(0) = 3.69$$; $$m(y) = 13.77$$, $$s(y) = 6.06$$; $$m(br) = 12.47$$, $$s(br) = 5.13$$
3. $$m(n) = 57.10$$, $$s(n) = 2.4$$
4. $$m(w) = 49.215$$, $$s(w) = 1.522$$

Consider the body weight, species, and gender variables in the Cicada data.

1. Classify the variables by type and level of measurement.
2. Compute the relative frequency function for species and plot the graph.
3. Compute the relative frequeny function for gender and plot the graph.
4. Compute the sample mean and standard deviation, and plot a density histogram for body weight.
5. Compute the sample mean and standard deviation, and plot a density histogrm for body weight by species.
6. Compute the sample mean and standard deviation, and plot a density histogram for body weight by gender.
1. body weight: continuous, ratio. species: discrete, nominal. gender: discrete, nominal.
2. $$f(0) = 0.423$$, $$f(1) = 0.519$$, $$f(2) = 0.058$$
3. $$f(0) = 0.567$$, $$f(1) = 0.433$$
4. $$m = 0.180$$, $$s = 0.059$$
5. $$m(0) = 0.168$$, $$s(0) = 0.054$$; $$m(1) = 0.185$$, $$s(1) = 0.185$$; $$m(2) = 0.225$$, $$s(2) = 0.107$$
6. $$m(0) = 0.206$$, $$s(0) = 0.052$$; $$m(1) = 0.145$$, $$s(1) = 0.051$$

Consider Pearson's height data.

1. Classify the variables by type and level of measurement.
2. Compute the sample mean and standard deviation, and plot a density histogram for the height of the father.
3. Compute the sample mean and standard deviation, and plot a density histogram for the height of the son.
2. $$m(x) = 67.69$$, $$s(x) = 2.75$$
3. $$m(y) = 68.68$$, $$s(y) = 2.82$$