Recall that by taking the expected value of various transformations of a random variable, we can measure many interesting characteristics of the distribution of the variable. In this section, we will study expected values that measure spread, skewness and other properties.
As usual, we start with a random experiment with probability measure \(\P\) on an underlying sample space. Suppose that \(X\) is a random variable for the experiment, taking values in \(S \subseteq \R\). Recall that \( \E(X) \), the expected value (or mean) of \(X\) gives the center of the distribution of \(X\). The variance of \(X\) is a measure of the spread of the distribution about the mean and is defined by
\[ \var(X) = \E\left([X - \E(X)]^2\right) \]Recall that the second moment of \(X\) about \(a\) is \(\E[(X - a)^2]\). Thus, the variance is the second moment of \(X\) about \(\mu = \E(X)\), or equivalently, the second central moment of \(X\). Second moments have a nice interpretation in physics, if we think of the distribution of \(X\) as a mass distribution in \(\R\). Then the second moment of \(X\) about \(a\) is the moment of inertia of the mass distribution about \(a\). This is a measure of the resistance of the mass distribution to any change in its rotational motion about \(a\). In particular, the variance of \(X\) is the moment of inertia of the mass distribution about the center of mass \(\mu\).
Suppose that \(X\) has a discrete distribution with probability density function \(f\) and mean \(\mu\). Then
\[ \var(X) = \sum_{x \in S} (x - \mu)^2 f(x) \]This follows from the discrete version of the change of variables theorem
Suppose that \(X\) has a continuous distribution with probability density function \(f\) and mean \(\mu\). Then
\[ \var(X) = \int_S (x - \mu)^2 f(x) dx \]This follows from the continuous version of the change of variables formula.
The standard deviation of \(X\) is the square root of the variance. It also measures dispersion about the mean but has the same physical units as the variable \(X\).
\[ \sd(X) = \sqrt{\var(X)} \]When the random variable \(X\) is understood, the standard deviation is often denoted by \(\sigma\), so that the variance is \(\sigma^2\).
The following exercises give some basic properties of variance, which in turn rely on basic properties of expected value. As usual, we assume that the stated expected values exist. Our first result is a variance formula that is usually better than the definition for computational purposes.
\(\var(X) = \E(X^2) - [\E(X)]^2\).
Let \( \mu = \E(X) \). Using the linearity of expected value we have
\[ \var(X) = \E(X - \mu)^2 = \E(X^2 - 2 \mu X + \mu^2) = \E(X^2) - 2 \mu \E(X) + \mu^2 = \E(X^2) - 2 \mu^2 + \mu^2 = \E(X^2) - \mu^2 \]Variance is always nonnegative, since its the expected value of a nonnegative random variable. Moreover, any random variable that really is random (not a constant) will have strictly positive variance.
The nonnegative property.
These results follow from the basic inequality properties of expected value. Let \( \mu = \E(X) \). First \( (X - \mu)^2 \ge 0 \) with probability 1 so \( \E[(X - \mu)^2] \ge 0 \). In addition, \( \E[(X - \mu)^2] = 0 \) if and only if \( \P(X = \mu) = 1 \).
Our next result shows how the variance and standard deviation are changed by a linear transformation of the random variable. In particular, note that variance, unlike general expected value, is not a linear operation. This is not really surprising since the variance is the expected value of a nonlinear function of the variable.
If \(a\) and \(b\) are constants then
Let \( \mu = \E(X) \). By linearity, \( \E(a + b X) = a + b \mu \). Hence \( \var(a + b X) = \E\left([(a + b X) - (a + b \mu)]^2\right) = \E\left(b^2 (X - \mu)^2\right) = b^2 \var(X) \). Part (b) follows from (a) by taking square roots.
Recall that when \( b \gt 0 \), the linear transformation \( x \mapsto a + b \, x \) is called a location-scale transformation and often corresponds to a change of location and change of scale in the physical units. For example, the change from inches to centimeters in a measurement of length is a scale transformation, and the change from Fahrenheit to Celsius in a measurement of temperature is both a location and scale transformation. The previous result shows that when a location-scale transformation is applied to a random variable, the standard deviation does not depend on the location parameter, but is multiplied by the scale factor.
The random variable \(Z\) given below has mean 0 and variance 1:
\[ Z = \frac{X - \E(X)}{\sd(X)} \]This result follows from the previous theorem. Let \( \mu = \E(X) \) and \( \sigma = \sd(X) \) so that \( Z = \frac{1}{\sigma} (X - \mu) \). Then \( \E(Z) = \frac{1}{\sigma} [\E(X) - \mu] = 0 \) and \( \var(Z) = \frac{1}{\sigma^2} \var(X) = 1 \).
The random variable \(Z\) in Exercise 6 is sometimes called the standard score associated with \(X\). Since \(X\) and its mean and standard deviation all have the same physical units, the standard score \(Z\) is dimensionless. It measures the directed distance from \(\E(X)\) to \(X\) in terms of standard deviations.
Let \( Z \) denote the standard score of \( X \), and suppose that \( Y = a + b X \) where \( a, \; b \in \R \) and \( b \ne 0 \). If \( b \gt 0 \), the standard score of \( Y \) is \( Z \) and if \( b \lt 0 \), the standard score of \( Y \) is \( -Z \).
\( E(Y) = a + b \E(X) \) and \( \sd(Y) = |b| \sd(X) \). Hence
\[ \frac{Y - \E(Y)}{\sd(Y)} = \frac{b}{|b|} \frac{X - \E(X)}{\sd(X)} \]As just noted, when \( b \gt 0 \), the variable \(Y = a + b X \) is a location-scale transformation and often corresponds to a change of physical units. Since the standard score is dimensionless, it's reasonable that the standard scores of \( X \) and \( Y \) are the same. On the other hand, when \(X \ge 0\), the ratio of standard deviation to mean is called the coefficient of variation. This quantity also is dimensionless, and is sometimes used to compare variability for random variables with different means.
\[ \text{cv}(X) = \frac{\sd(X)}{\E(X)} \]Chebyshev's inequality (named after Pafnuty Chebyshev) gives an upper bound on the probability that a random variable will be more than a specified distance from its mean. This is often useful in applied problems where the distribution is unknown, but the mean and variance are at least approximately known. In the following two exercises, suppose that \(X\) is a real-valued random variable with mean \(\mu = \E(X)\) and standard deviation \(\sigma = \sd(X)\).
Chebyshev's inequality:
\[ \P(|X - \mu| \ge t) \le \frac{\sigma^2}{t^2}, \quad t \gt 0 \]From Markov's inequality, \(\P(|X - \mu| \ge t) = \P[(X - \mu)^2 \ge t^2] \le \E[(X - \mu)^2] / t^2 = \sigma^2 / t^2\) .
An alternate version of Chebyshev's inequality is
\[\P(|X - \mu| \ge k \sigma) \le \frac{1}{k^2}, \quad k \gt 0 \]Let \( t = k \sigma \) in the first version of Chebyshev's inequality.
The usefulness of the Chebyshev inequality comes from the fact that it holds for any distribution (assuming only that the mean and variance exist). The tradeoff is that for many specific distributions, the Chebyshev bound is rather crude. Note in particular that in the last exercise, the bound is useless when \(k \le 1\), since 1 is an upper bound for the probability of any event.
Suppose that \(X\) is an indicator variable with \(p = \P(X = 1)\), where \(p \in [0, 1]\). Then
We proved part (a) in the section on expected value, although the result is so simple that the derivation is trivial. For part (b), note that \( X^2 = X \) since \( X \) only takes values 0 and 1. Hence \( \E(X^2) = p \) and therefore \( \var(X) = p - p^2 = p (1 - p) \).
The graph of \(\var(X)\) as a function of \(p\) is a parabola, opening downward, with roots at 0 and 1. Thus the minimum value of \(\var(X)\) is 0, and occurs when \(p = 0\) and \(p = 1\) (when \( X \) is deterministic). The maximum value is \(\frac{1}{4}\) and occurs when \(p = \frac{1}{2}\).
Suppose that \(X\) has the discrete uniform distribution on \(\{m, m+1, \ldots, n\}\) where \(m \le n\). Then
Suppose that \(X\) has the continuous uniform distribution on the interval \([a, b]\) where \( a \lt b \). Then
Note that in both the discrete and continuous cases, the variance depends only on the length of the interval.
Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice. Here are three:
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular probabilities that we use (\( \frac{1}{4} \) and \( \frac{1}{8} \)) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat. In the following problems, you will compute the mean and variance for each of the various types of dice. Be sure to compare the results.
A standard, fair die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
An ace-six flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
A two-five flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
A three-four flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
In the dice experiment, select one die. Run the experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation in each of the following cases:
Recall that the Poisson distribution has probability density function
\[ f(n) = e^{-a} \, \frac{a^n}{n!}, \quad n \in \N\]where \(a \gt 0\) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number of random points
in a region of time or space; the parameter \(a\) is proportional to the size of the region. The Poisson distribution is studied in detail in the chapter on the Poisson Process.
Suppose that \(N\) has the Poisson distribution with parameter \(a\). Then
Part (a) was shown in the section on expected value. For part (b), we compute the second factorial moment:
\[ \E[N (N - 1)] = \sum_{n=1}^\infty n (n - 1) e^{-a} \frac{a^n}{n!} = \sum_{n=2}^\infty e^{-a} \frac{a^n}{(n - 2)!} = e^{-a} a^2 \sum_{n=2}^\infty \frac{a^{n-2}}{(n - 2)!} = a^2 e^{-a} e^a = a^2\]Hence, \( E(N^2) = \E[N(N - 1)] + \E(N) = a^2 + a \), so finally \( \var(N) = (a^2 + a) - a^2 = a \).
Thus, the parameter is both the mean and the variance of the distribution.
In the Poisson experiment, the parameter is \(a = r \, t\). Vary the parameter and note the size and location of the mean-standard deviation bar. For selected values of the parameter, run the experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.
Recall that the geometric distribution on \(\N_+\) is a discrete distribution with probability density function
\[ f(n) = p \, (1 - p)^{n - 1}, \quad n \in \N_+ \]where \(p \in (0, 1]\) is a parameter. The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials with success parameter \(p\).
Suppose that \(N\) has the geometric distribution on \(\N_+\) with success parameter \(p\). Then
We proved part (a) in the section on expected value. For part (b) we will compute the second factorial moment. Thus
\[ \E[N(N - 1)] = \sum_{n = 2}^\infty n (n - 1) (1 - p)^{n-1} p = p(1 - p) \frac{d^2}{dp^2} \sum_{n=0}^\infty (1 - p)^n = p (1 - p) \frac{d^2}{dp^2} \frac{1}{p} = p (1 - p) \frac{2}{p^3} = \frac{2 (1 - p)}{p^2}\]Hence \( \E(N^2) = \E[N(N - 1)] + \E(N) = 2 / p^2 - 1 / p \) and hence \( \var(X) = 2 / p^2 - 1 / p - 1 / p^2 = 1 / p^2 - 1 / p \).
Note that the variance is 0 when \(p = 1\), not surprising since \( X \) is deterministic in this case.
In the negative binomial experiment, set \(k = 1\) to get the geometric distribution . Vary \(p\) with the scroll bar and note the size and location of the mean-standard deviation bar. For selected values of \(p\), run the experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.
Suppose that \(N\) has the geometric distribution with parameter \(p = \frac{3}{4}\). Compute the true value and the Chebyshev bound for the probability that \(N\) is at least 2 standard deviations away from the mean.
Recall that the exponential distribution is a continuous distribution with probability density function
\[ f(t) = r \, e^{-r \, t}, \quad 0 \le t \lt \infty \]where \(r \gt 0\) is the with rate parameter. This distribution is widely used to model failure times and other arrival times
. The exponential distribution is studied in detail in the chapter on the Poisson Process.
Suppose that \(T\) has the exponential distribution with rate parameter \(r\). Then
Thus, for the exponential distribution, the mean and standard deviation are the same.
In the gamma experiment, set \(k = 1\) to get the exponential distribution. Vary \(r\) with the scroll bar and note the size and location of the mean-standard deviation bar. For selected values of \(r\), run the experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.
Suppose that \(X\) has the exponential distribution with rate parameter \(r \gt 0\). Compute the true value and the Chebyshev bound for the probability that \(X\) is at least \(k\) standard deviations away from the mean.
Recall that the Pareto distribution is a continuous distribution with probability density function
\[ f(x) = \frac{a}{x^{a + 1}}, \quad 1 \le x \lt \infty \]where \(a \gt 0\) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special Distributions.
Suppose that \(X\) has the Pareto distribution with shape parameter \(a\). Then
In the special distribution simuator, select the Pareto distribution. Vary \(a\) with the scroll bar and note the size and location of the mean/standard deviation bar. For each of the following values of \(a\), run the experiment 1000 times and note the behavior of the empirical mean and standard deviation.
Recall that the standard normal distribution is a continuous distribution with density function
\[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}, \quad z \in \R \]Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in the chapter on Special Distributions.
Suppose that \(Z\) has the standard normal distribution. Then
We showed that \( \E(Z) = 0 \) in the section on properties of expected value. Hence \( \var(Z) = \E(Z^2) = \int_{-\infty}^\infty z^2 \phi(z) \, dz \). Integrate by parts with \( u = z \) and \( dv = z \phi(z) \, dz \). Thus, \( du = dz \) and \( v = -\phi(z) \). Hence
\[ \var(Z) = -z \phi(z) \bigg|_{-\infty}^\infty + \int_{-\infty}^\infty \phi(z) \, dz = 0 + 1 \]Suppose again that \(Z\) has the standard normal distribution and that \(\mu \in (-\infty, \infty)\) and \(\sigma \in (0, \infty)\). Recall that \(X = \mu + \sigma Z\) has the normal distribution with location parameter \(\mu\) and scale parameter \(\sigma\). Then
These results follow directly from Theorem 5: \( \E(X) = \mu + \sigma \E(Z) = \mu + 0 = \mu \) and \( \var(X) = \sigma^2 \E(Z) = \sigma^2 \cdot 1 = \sigma^2 \).
Thus, as the notation suggests, the location parameter \(\mu\) is also the mean and the scale parameter \(\sigma\) is also the standard deviation.
In the special distribution simulator, select the normal distribution. Vary the parameters and note the shape and location of the mean-standard deviation bar. For selected parameter values, run the experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.
The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions.
Graph the density functions below and compute the mean and variance of each.
The particular beta distribution in part (d) is also known as the arcsine distribution.
Suppose that \(X\) is a real-valued random variable with \(\E(X) = 5\) and \(\var(X) = 4\). Find each of the following:
Suppose that \(X\) is a real-valued random variable with \(\E(X) = 2\) and \(\E[X(X - 1) = 8]\). Find each of the following:
The expected value \(\E[X(X - 1)]\) is an example of a factorial moment.
Suppose that \(X_1\) and \(X_2\) are independent, real-valued random variables with \(\E(X_i) = \mu_i\) and \(\var(X_i) = \sigma_i^2\) for \(i \in \{1, 2\}\). Then
\[ \var(X_1 X_2) = (\sigma_1^2 + \mu_1^2) (\sigma_2^2 + \mu_2^2) - \mu_1^2 \mu_2^2 \]Marilyn Vos Savant has an IQ of 228. Assuming that the distribution of IQ scores has mean 100 and standard deviation 15, find Marilyn's standard score.
\(z = 8.53\)
Suppose that \(X\) is a real-valued random variable. Recall again that the variance of \(X\) is the second moment of \(X\) about the mean, and measures the spread of the distribution of \(X\) about the mean. The third and fourth moments of \(X\) about the mean also measure interesting features of the distribution. The third moment measures skewness, the lack of symmetry, while the fourth moment measures kurtosis, the degree to which the distribution is peaked. The actual numerical measures of these characteristics are standardized to eliminate the physical units, by dividing by an appropriate power of the standard deviation. As usual, we assume that all expected values given below exist, and we will let \(\mu = \E(X)\) and \(\sigma^2 = \var(X)\). We assume that \(\sigma \gt 0\), so that the random variable is really random.
The skewness of \(X\) is the third moment of the standard score \(Z = (X - \mu) / \sigma\):
\[ \skew(X) = \E\left[\left(\frac{X - \mu}{\sigma}\right)^3\right] \]The distribution of \(X\) is said to be positively skewed, negatively skewed or unskewed depending on whether \(\skew(X)\) is positive, negative, or 0. In the unimodal case, if the distribution is positively skewed then the probability density function has a long tail to the right, and if the distribution is negatively skewed then the probability density function has a long tail to the left.
Suppose that the distribution of \(X\) is symmetric about \(a\). That is, the distribution of \( a - X \) is the same as the distribution of \( X - a \). Then
We proved part (a) in the section on Properties of Expected Value. Thus, \( \skew(X) = \E[(X - a)^3] / \sigma^3 \). But by symmetry and linearity, \( \E[(X - a)^3] = \E[(a - X)^3] = - \E[(X - a)^3] \), so it follows that \( \E[(X - a)^3] = 0 \).
\(\skew(X)\) can be expressed in terms of the moments of \(X\).
\[ \skew(X) = \frac{\E(X^3) - 3 \mu \E(X^2) + 2 \mu^3}{\sigma^3} \]\( (X - \mu)^3 = X^3 - 3 X^2 \mu + 3 X \mu^2 - \mu^3 \). From the linearity of expected value we have
\[ \E[(X - \mu)^3] = \E(X^3) - 3 \mu \E(X^2) + 3 \mu^2 \E(X) - \mu^3 = E(X^3) - 3 \mu \E(X^2) + 2 \mu^3 \]Since skewness is defined in terms of an odd power of the standard score, it's invariant under a linear transformation with positve slope (a location-scale transformation of the distribution). On the other hand, if the slope is negative, skewness changes sign.
Suppose that \(a \in \R\) and \(b \in \R \setminus \{0\}\). Then
\[ \skew(a + b X) = \begin{cases} \skew(X), & b \gt 0 \\ -\skew(X), & b \lt 0 \end{cases} \]This follows directly from the definition and Theorem 29.
The kurtosis of \(X\) is the fourth moment of the standard score \(Z = (X - \mu) / \sigma\):
\[ \kurt(X) = \E\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] \]Kurtosis comes from the Greek word for bulging. In the unimodal case, the probability density function of a distribution with large kurtosis has a sharper peak and fatter tails, compared with the probability density function of a distribution with smaller kurtosis.
\(\kurt(X)\) can be expressed in terms of the moments of \(X\).
\[ \kurt(X) = \frac{\E(X^4) - 4 \mu \E(X^3) + 6 \mu^2 \E(X^2) - 3 \mu^4}{\sigma^4} \]\( (X - \mu)^4 = X^4 - 4 X^3 \mu + 6 X^2 \mu^2 - 4 X \mu^3 + \mu^4 \). From linearity of expected value, we have
\[ \E[(X - \mu)^4 = \E(X^4) - 4 \mu \E(X^3) + 6 \mu^2 \E(X^2) - 4 \mu^3 \E(X) + \mu^4 = \E(X^4) - 4 \mu \E(X^3) + 6 \mu^2 \E(X^2) - 3 \mu^4 \]Since kurtosis is defined in terms of an even power of the standard score, it's invariant under linear transformations.
Suppose that \(a \in \R\) and \(b \in \R \setminus\{0\}\). Then \(\kurt(a + b X) = \kurt(X)\).
This follows directly from the definition and Theorem 29.
You will show in Exercise 48 below that the kurtosis of the standard normal distribution is 3. Using the standard normal distribution as a benchmark, the excess kurtosis of a random variable \(X\) is defined to be \(\kurt(X) - 3\). Some authors use the term kurtosis to mean what we have defined as excess kurtosis.
In Exercise 9 you computed the mean and variance of an indicator variable. In the next exercise, you will complement the analysis by computing the skewness and kurtosis.
Suppose that \(X\) is an indicator variable with \(\P(X = 1) = p\), where \(0 \lt p \lt 1\). Then
Recall the definitions of fair, ace-six flat, two-five flat, and three-four flat dice given earlier. In the following problems you will compute the skewness and kurtosis for the various types of dice. Be sure to compare the results.
A standard, fair die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
An ace-six flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
An two-five flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
A three-four flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
Suppose that \(X\) has uniform distribution on the interval \([a, b]\). Compute each of the following:
Suppose that \(X\) has the exponential distribution with rate parameter \(r \gt 0\). Compute each of the following:
Suppose that \(X\) has the Pareto distribution with shape parameter \(a \gt 4\). Compute each of the following:
Suppose that \(Z\) has the standard normal distribution. Compute each of the following:
Graph the following probability density functions and compute the mean, variance, skewness and kurtosis of each distribution. The distributions are all members of the family of beta distributions; the last one is also known as the arcsine distribution.
Variance and higher moments are related to the concept of norm and distance in the theory of vector spaces. This connection can help unify and illuminate some of the ideas.
Our vector space \(\mathscr{V}\) consists of all real-valued random variables defined on a fixed probability space \((\Omega, \mathscr{F}, \P)\) (that is, relative to a given random experiment). Recall that two random variables are equivalent if they are equal with probability 1. We consider two such random variables as the same vector, so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the usual multiplication of a real-valued random variable by a real (non-random) number.
Let \(X\) be a real-valued random variable. For \(k \ge 1\), we define the \(k\)-norm by
\[ \|X\|_k = \left[\E(|X|^k)\right]^{1 / k} \]Thus, \(\|X\|_k\) is a measure of the size of \(X\) in a certain sense. The following exercises establish the fundamental properties.
For \(X \in \mathscr{V}\),
These results follow from the basic inequality properties of expected value. First \( |X|^k \ge 0 \) with probability 1, so \( \E(|X|^k) \ge 0 \). In addition, \( \E(|X|^k) = 0 \) if and only if \( \P(X = 0) = 1 \).
\(\|c X\|_k = |c| \, \|X\|_k\) for \(X \in \mathscr{V}\) and \(c \in \R\).
The next exercise gives Minkowski's inequality, named for Hermann Minkowski. It is also known as the triangle inequality.
\(\|X + Y\|_k \le \|X\|_k + \|Y\|_k\) for \(X \in \mathscr{V}\) and \(Y \in \mathscr{V}\) .
The first quadrant \(S = \{(x, y) \in \R^2: x \ge 0, \; y \ge 0\}\) is a convex set and \(g(x, y) = (x ^{1/k} + y^{1/k})^k\) is concave on \(S\). From Jensen's inequality, if \(U\) and \(V\) are nonnegative random variables, then
\[ \E\left[(U^{1/k} + V^{1/k})^k\right] \le \left([\E(U)]^{1/k} + [\E(V)]^{1/k}\right)^k \]Letting \(U = |X|^k\) and \(V = |Y|^k\) and simplifying gives the result. To show that \( g \) really is concave on \( S \), we can compute the second partial derivatives. Let \( h(x, y) = x^{1/k} + y^{1/k} \) so that \( g = h^k \). Then
\[ \begin{align} g_{xx} & = \frac{k-1}{k} h^{k-2} x^{1/k - 2}(x^{1/k} - h) \\ g_{yy} & = \frac{k-1}{k} h^{k-2} y^{1/k - 2}(y^{1/k} - h) \\ g_{xy} & = \frac{k-1}{k} h^{k-2} x^{1/k - 1} y^{1/k - 1} \end{align} \]Clearly \( h(x, y) \ge x^{1/k} \) and \( h(x, y) \ge y^{1/k} \) for \( x \ge 0 \) and \( y \ge 0 \), so \( g_{xx} \)and \( g_{yy} \), the diagonal entries of the second derivative matrix, are nonpositive on \( S \). A little algebra shows that the determinant of the second derivative matrix \( g_{xx} g_{yy} - g_{xy}^2 = 0\) on \( S \). Thus, the second derivative matrix of \( g \) is negative semi-definite.
It follows from Exercises 51-53 that the set of random variables with finite \(k\)th moment forms a subspace of our parent vector space \(\mathscr{V}\), and that the \(k\)-norm really is a norm on this vector space:
\[ \mathscr{V}_k = \{X \in \mathscr{V}: \|X\|_k \lt \infty\} \]Our next exercise gives Lyapunov's inequality, named for Aleksandr Lyapunov. This inequality shows that the \(k\)-norm of a random variable is increasing in \(k\).
If \(j \le k\) then \(\|X\|_j \le \|X\|_k\) for \(X \in \mathscr{V}\).
Note that \(S = \{x \in \R: x \ge 0\}\) is convex and \(g(x) = x^{k/j}\) is convex on \(S\). From Jensen's inequality, if \(U\) is a nonnegative random variable then \([\E(U)]^{k/j} \le \E(U^{k/j})\). Letting \(U = |X|^j\) and simplifying gives the result.
Lyapunov's inequality shows that if \(j \le k\) and \(X\) has a finite \(k\)th moment, then \(X\) has a finite \(j\)th moment as well. Thus, \(\mathscr{V}_k\) is a subspace of \(\mathscr{V}_j\).
Suppose that \(X\) is uniformly distributed on the interval \([0, 1]\).
Suppose that \(X\) has probability density function \(f(x) = \frac{a}{x^{a+1}}\) for \(1 \le x \lt \infty\), where \(a \gt 0\) is a parameter. Thus, \(X\) has the Pareto distribution with shape parameter \(a\).
Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Verify Minkowski's inequality.
The \(k\)-norm, like any norm on a vector space, can be used to measure distance; we simply compute the norm of the difference between two vectors. Thus, we define the \(k\)-distance (or \(k\)-metric) between real-valued random variables \(X\) and \(Y\) to be
\[ d_k(X, Y) = \|X - Y\|_k = \left[\E(|X - Y|^k)\right]^{1/k} \]The properties in the following exercises are analogies of the properties in Exercises 51-53 (and thus very little additional work is required for the proofs). These properties show that the \(k\)-metric really is a metric on \( \mathscr{V}_k \).
Suppose that \(X, \; Y \in \mathscr{V}\). Then
These results follow directly from Theorem 51.
\( d_k(X, Y) = d_k(Y, X) \) for \( X, \; Y \in \mathscr{V} \).
\(d_k(X, Z) \le d_k(X, Y) + d_k(Y, Z)\) for \(X, \; Y, \; Z \in \mathscr{V}\) (this is another version of the triangle inequality).
From Minkowski's inequality (Theorem 53),
\[ d_k(X, Z) = \|X - Z\|_k = \|(X - Y) + (Y - Z) \|_k \le \|X - Y\|_k + \|Y - Z\|_k = d_k(X, Y) + d_k(Y, Z) \]Thus, the standard deviation is simply the 2-distance from \(X\) to its mean \( \mu = \E(X) \):
\[ \sd(X) = d_2(X, \mu) = \|X - \mu\|_2 = \sqrt{\E[(X - \mu)^2]} \]and the variance is the square of this. More generally, the \(k\)th moment of \(X\) about \(a\) is simply the \(k\)th power of the \(k\)-distance from \(X\) to \(a\). The 2-distance is especially important for reasons that will become clear below and in the next section. This distance is also called the root mean square distance.
Measures of center and measures of spread are best thought of together, in the context of a measure of distance. For a real-valued random variable \(X\), we first try to find the constants \(t \in \R\) that are closest to \(X\), as measured by the given distance; any such \(t\) is a measure of center relative to the distance. The minimum distance itself is the corresponding measure of spread.
Let us apply this procedure to the 2-distance. Thus, we define the root mean square error function by
\[ d_2(X, t) = \|X - t\|_2 = \sqrt{\E[(X - t)^2]}, \quad t \in \R \]\(d_2(X, t)\) is minimized when \(t = \E(X)\) and that the minimum value is \(\sd(X)\).
Note that the minimum value of \(d_2(X, t)\) occurs at the same points as the minimum value of \(d_2^2(X, t) = \E[(X - t)^2]\) (this is the mean square error function). Expanding and taking expected values term by term gives
\[ \E[(X - t)^2 = \E(X^2) - 2 t \E(X) + t^2 \]This is a quadratic function of \( t \) and hence the graph is a parabola opening upward. The minimum occurs at \( t = \E(X) \), and the minimum value is \( \var(X) \). Hence the minimum value of \( t \mapsto d_2(X, t) \) also occurs at \( t = \E(X) \) and the minimum value is \( \sd(X) \).
The physical interpretation of this result is that the moment of inertia of the mass distribution of \(X\) about \(t\) is minimized when \(t = \mu\), the center of mass.
In the error function applet, select the root mean square error function. Click on the \( x \)-axis to generate an empirical distribution, and note the shape and location of the graph of the error function.
Next, let us apply our procedure to the 1-distance. Thus, we define the mean absolute error function by
\[ d_1(X, t) = \|X - t\|_1 = \E[|X - t|] \]We will show that \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\). (Recall that the set of medians of \( X \) forms a closed, bounded interval.) We start with a discrete case, because it's easier and has special interest.
Suppose that \(X\) has a discrete distribution with values in a finite set \(S \subseteq \R\). Then \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\).
Note first that \(\E(|X - t|) = \E(t - X, \, X \le t) + \E(X - t, \, X \gt t)\). Hence \(\E(|X - t|) = a_t \, t + b_t\), where \(a_t = 2 \, \P(X \le t) - 1\) and where \(b_t = \E(X) - 2 \, \E(X, \, X \le t)\). Note that \(\E(|X - t|)\) is a continuous, piecewise linear function of \(t\), with corners at the values in \(S\). That is, the function is a linear spline. Let \(m\) be the smallest median of \(X\). If \(t \lt m\) and \(t \notin S\), then the slope of the linear piece at \(t\) is negative. Let \(M\) be the largest median of \(X\). If \(t \gt M\) and \(t \notin S\), then the slope of the linear piece at \(t\) is positive. If \(t \in (m, M)\) then the slope of the linear piece at \(t\) is 0. Thus \(\E(|X - t|)\) is minimized for every \(t\) in the median interval \([m, M]\).
The last exercise shows that mean absolute error has a couple of basic deficiencies as a measure of error:
Indeed, when \(X\) does not have a unique median, there is no compelling reason to choose one value in the median interval, as the measure of center, over any other value in the interval.
In the error function applet, select the mean absolute error function. Click on the \( x \)-axis to generate an empirical distribution, and note the shape and location of the graph of the error function.
Let \(X\) be an indicator random variable with \(\P(X = 1) = p\), where \(0 \le p \le 1\). Graph \(\E(|X - t|)\) as a function of \(t \in \R\) in each of the cases below. In each case, find the minimum value of the function and the values of \(t\) where the minimum occurs.
Suppose now that \(X\) has a general distribution on \(\R\). Then \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\).
Suppose that \(s \lt t\). Computing the expected value over the events \(X \le s\), \(s \lt X \le t\), and \(X \ge t\), and simplifying gives
\[ \E(|X - t|) = \E(|X - s|) + (t - s) \, [2 \, \P(X \le s) - 1] + 2 \, \E(t - X, \, s \lt X \le t) \]Suppose that \(t \lt s\). Using similar methods gives
\[ \E(|X - t|) = \E(|X - s|) + (t - s) \, [2 \, \P(X \lt s) - 1] + 2 \, \E(X - t, \, t \le X \lt s) \]Note that the last terms on the right in these equations are nonnegative. If we take \(s\) to be a median of \(X\), then the middle terms on the right in the equations are also nonnegative. Hence if \(s\) is a median of \(X\) and \(t\) is any other number then \(\E(|X - t|) \ge \E(|X - s|)\).
Suppose that \(X\) is uniformly distributed on the interval \([0, 1]\). Find \(d_1(X, t) = \E(|X - t|)\) as a function of \(t\) and sketch the graph. Find the minimum value of the function and the value of \(t\) where the minimum occurs.
Suppose that \(X\) is uniformly distributed on the set \([0, 1] \cup [2, 3]\). Find \(d_1(X, t) = \E(|X - t|)\) as a function of \(t\) and sketch the graph. Find the minimum value of the function and the values of \(t\) where the minimum occurs.
Whenever we have a measure of distance, we automatically have a criterion for convergence. Let \((X_1, X_2, \ldots)\) and \(X\) be real-valued random variables defined on the same sample space (that is, defined for the same random experiment). We say that \(X_n \to X\) as \(n \to \infty\) in \(k\)th mean if
\[ d_k(X_n, X) = \|X_n - X\|_k \to 0 \text{ as } n \to \infty \]or equivalently
\[ \E(|X_n - X|^k) \to 0 \text{ as } n \to \infty \]When \(k = 1\), we simply say that \(X_n \to X\) as \(n \to \infty\) in mean; when \(k = 2\), we say that \(X_n \to X\) as \(n \to \infty\) in mean square. These are the most important special cases.
If \(j \lt k\), then \(X_n \to X\) as \(n \to \infty\) in \(k\)th mean implies \(X_n \to X\) as \(n \to \infty\) in \(j\)th mean.
This follows from Lyapunov's inequality: \( 0 \le d_j(X_n, X) \le d_k(X_n, X) \to 0 \) as \( n \to \infty \).
Our next sequence of exercises shows that convergence in mean is stronger than convergence in probability.
If \(X_n \to X\) as \(n \to \infty\) in mean, then \(X_n \to X\) as \(n \to \infty\) in probability.
This follows from Markov's inequality. For \( \epsilon \gt 0 \), \(0 \le \P(|X_n - X| \gt \epsilon) \le \E(|X_n - X|) / \epsilon \to 0 \) as \( n \to \infty \).
The converse is not true. Moreover, convergence with probability 1 does not imply convergence in \(k\)th mean and convergence in \(k\)th mean does not imply convergence with probability 1. The next two exercises give some counterexamples.
Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent random variables with
\[ \P(X = n^3) = \frac{1}{n^2}, \; \P(X_n = 0) = 1 - \frac{1}{n^2}; \quad n \in \N_+ \]Part (a), follows from the basic characterization of convergence with probability 1: \( \sum_{n=1}^\infty \P(X_n \gt \epsilon) = \sum_{n=1}^\infty 1 / n^2 \lt \infty \) for \( 0 \lt \epsilon \lt 1 \). Part (b) follows since convergence with probability 1 implies convergence in probability. For (c), note that \( \E(X_n) = n^3 / n^2 = n \) for \( n \in \N_+ \).
Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent indicator random variables with
\[ \P(X_n = 1) = \frac{1}{n}, \; \P(X_n = 0) = 1 - \frac{1}{n}; \quad n \in \N_+ \]Parts (a) and (b), follow from the second Borel-Cantelli lemma since \( \sum_{n=1}^\infty \P(X_n = 1) = \sum_{n=1}^\infty 1 / n = \infty \) and \( \sum_{n=1}^\infty \P(X_n = 0) = \sum_{n=1}^\infty (1 - 1 / n) = \infty \). Part (c) follows from parts (a) and (b). For part (d) note that \( \E(X_n) = 1 / n \to 0 \) as \( n \to \infty \).
The implications in the various modes of convergence are shown below; no other implications hold in general.
For a related statistical topic, see the section on the Sample Variance in the chapter on Random Samples. The variance of a sum of random variables is best understood in terms of a related concept known as covariance, that will be studied in detail in the next section.