]>
Suppose that we have a basic random experiment, and that is a real-valued random variable for the experiment with mean and standard deviation . Additionally, let
denote the moment about the mean. In particular, note that , , and . We assume that .
We repeat the basic experiment times to form a new, compound experiment, with a sequence of independent random variables , each with the same distribution as . In statistical terms, is a random sample of size from the distribution of . Recall that the sample mean
is a natural measure of the center of the data and a natural estimator of the distribution mean . In this section, we will derive statistics that are natural measures of the dispersion of the data and are natural estimators of the distribution variance . The statistics that we will derive are different, depending on whether is known or unknown; for this reason, is referred to as a nuisance parameter for the problem of estimating .
First we will assume that is known. Although this is almost always an artificial assumption, it is a nice place to start because the analysis is relatively easy. Let
Show that is the sample mean for a random sample of size from the distribution of .
Use the result of Exercise 1 to show that
In particular 2 (a) means that is an unbiased estimator of .
Use basic properties of covariance to show that . It follows that the sample mean and the special sample variance are uncorrelated if and are asymptotically uncorrelated in any case.
The square root of the special sample variance is a special version of the sample standard deviation, denoted .
Use Jensen's inequality to show that . Thus, is a biased estimator that tends to underestimate .
Show that if is a constant then
Consider now the more realistic case in which is unknown. In this case, a natural approach is to average, in some sense, over . It might seem that we should average by dividing by . However, another approach is to divide by whatever constant would give us an unbiased estimator of .
Use basic algebra to show that
Use the result in Exercise 6 and basic properties of expected value to show that
From Exercise 7, the random variable
is an unbiased estimator of ; it is called the sample variance. As a practical matter, when is large, it makes little difference whether we divide by or .
The following alternate formula follows immediately from Exercise 6, and it is better for some purposes.
Show that
Use the formula in the previous exercise and the (strong) law of large numbers to show that as with probability 1.
Show that if is a constant then
Show that
The square root of the sample variance is the sample standard deviation, denoted .
Use Jensen's inequality to show that . Thus, is a biased estimator than tends to underestimate .
In this section we will derive formulas for the variance of the sample variance and the covariance between the sample mean and the sample variance. Our first series of exercises will show that
Verify the following result. Hint: Start with the expression on the right. Expand the term , and take the sums term by term.
It follows that is the sum of all of the pairwise covariances of the terms in the expansion of Exercise 13.
Suppose that , Verify the following results. (Hint: In , add and subtract , and then expand and use independence.)
Finally, derive the formula for by showing that
Show that . Does this seem reasonable?
Show that as .
Use similar techniques to show that . In particular, note that . Again, the sample mean and variance are uncorrelated if , and asymptotically uncorrelated otherwise.
Many of the applets in this project are simulations of experiments with a basic random variable of interest. When you run the simulation, you are performing independent replications of the experiment. In most cases, the applet displays the standard deviation of the distribution, both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you run the simulation, sample standard deviation is also displayed numerically in the table and graphically as the radius of the red horizontal bar in the graph box.
In the binomial coin experiment, the random variable is the number of heads. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample standard deviation to the distribution standard deviation.
In the simulation of the matching experiment, the random variable is the number of matches. Run the simulation 1000 times updating every 10 runs and note the apparent convergence of the sample standard deviation to the distribution standard deviation.
Run the simulation of the exponential experiment 1000 times with an update frequency of 10. Note the apparent convergence of the sample standard deviation to the distribution standard deviation.
The sample mean and standard deviation are often computed in exploratory data analysis, as measures of the center and spread of the data, respectively.
Compute the sample mean and standard deviation for Michelson's velocity of light data.
Compute the sample mean and standard deviation for Cavendish's density of the earth data.
Compute the sample mean and standard deviation of the net weight in the M&M data.
Compute the sample mean and standard deviation of the petal length variable for the following cases in Fisher's iris data. Compare the results.
Suppose that instead of the actual data, we have a frequency distribution with classes , class marks , and frequencies . Thus,
In this case, approximate values of the sample mean and variance are, respectively,
These approximations are based on the hope that the data values in each class are well represented by the class mark.
In the interactive histogram, select mean and standard deviation. Set the class width to 0.1 and construct a frequency distribution with at least 6 nonempty classes and at least 10 values. Compute the mean, variance, and standard deviation by hand, and verify that you get the same results as the applet.
In the interactive histogram, select mean and standard deviation. Set the class width to 0.1 and construct a distribution with at least 30 values of each of the types indicated below. Then increase the class width to each of the other four values. As you perform these operations, note the position and size of the mean ± standard deviation bar.
In the interactive histogram, construct a distribution that has the largest possible standard deviation.
Based on your answer to Exercise 28, characterize the distributions (on a fixed interval ) that have the largest possible standard deviation.