\(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\bs}{\boldsymbol}\)
  1. Virtual Laboratories
  2. 3. Expected Value
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6

1. Definitions and Properties

Expected value is one of the most important concepts in probability. The expected value of a real-valued random variable gives the center of the distribution of the variable, in a special sense. Additionally, by computing expected values of various real transformations of a general random variable, we con extract a number of interesting characteristics of the distribution of the variable, including measures of spread, symmetry, and correlation. In a sense, expected value is a more general concept than probability itself.

Basic Concepts

Definitions

As usual, we start with a random experiment with probability measure \(\P\) on an underlying sample space \(\Omega\). Suppose that \(X\) is a random variable for the experiment, taking values in \(S \subseteq \R\). If \(X\) has a discrete distribution with probability density function \(f\) (so that \(S\) is countable), then the expected value of \(X\) is defined by

\[ \E(X) = \sum_{x \in S} x f(x) \]

assuming that the sum is absolutely convergent (that is, assuming that the sum with \(x\) replaced by \(|x|\) is finite). The assumption of absolute convergence is necessary to ensure that the sum in the expected value above does not depend on the order of the terms. (Of course, if \(S\) is finite there are no convergence problems.) If \(X\) has a continuous distribution with probability density function \(f\) (and so \(S\) is typically an interval), then the expected value of \(X\) is defined by

\[ \E(X) = \int_S x f(x) dx \]

assuming that the integral is absolutely convergent (that is, assuming that the integral with \(x\) replaced by \(|x|\) is finite). Finally, suppose that \(X\) has a mixed distribution, with partial discrete density \(g\) on \(D\) and partial continuous density \(h\) on \(C\), where \(D\) and \(C\) are disjoint, \(D\) is countable, \(C\) is typically an interval, and \(S = D \cup C\). The expected value of \(X\) is defined by

\[ \E(X) = \sum_{x \in D} x g(x) + \int_C x h(x) dx \]

assuming again that the sum and integral converge absolutely.

Interpretation

The expected value of \(X\) is also called the mean of the distribution of \(X\) and is frequently denoted \(\mu\). The mean is the center of the probability distribution of \(X\) in a special sense. Indeed, if we think of the distribution as a mass distribution (with total mass 1), then the mean is the center of mass as defined in physics. The two pictures below show discrete and continuous probability density functions; in each case the mean \(\mu\) is the center of mass, the balance point.

DiscreteCenterMass.png ContinuousCenterMass.png

Recall the other measures of the center of a distribution that we have studied:

To understand expected value in a probabilistic way, suppose that we create a new, compound experiment by repeating the basic experiment over and over again. This gives a sequence of independent random variables \((X_1, X_2, \ldots)\), each with the same distribution as \(X\). In statistical terms, we are sampling from the distribution of \(X\). The average value, or sample mean, after \(n\) runs is

\[ M_n = \frac{1}{n} \sum_{i=1}^n X_i \]

The average value \(M_n\) converges to the expected value \(\E(X)\) as \(n \to \infty\). The precise statement of this is the law of large numbers, one of the fundamental theorems of probability. You will see the law of large numbers at work in many of the simulation exercises given below.

Moments

If \(a \in \R\) and \(n \in \N\), the moment of \(X\) about \(a\) of order \(n\) is defined to be

\[ \E[(X - a)^n ]\]

The moments about 0 are simply referred to as moments (or sometimes raw moments). The moments about \(\mu\) are the central moments. The second central moment is particularly important, and is studied in detail in the section on variance. In some cases, if we know all of the moments of \(X\), we can determine the entire distribution of \(X\). This idea is explored in the section on generating functions.

Conditional Expected Value

The expected value of a random variable \(X\) is based, of course, on the probability measure \(\P\) for the experiment. This probability measure could be a conditional probability measure, conditioned on a given event \(B \subseteq \Omega\) for the experiment (with \(\P(B) \gt 0\)). The usual notation is \(\E(X \mid B)\), and this expected value is computed by the definitions given above, except that the conditional probability density function \(x \mapsto f(x \mid B)\) replaces the ordinary probability density function \(f\). It is very important to realize that, except for notation, no new concepts are involved. All results that we obtain for expected value in general have analogues for these conditional expected values.

Basic Results

The purpose of this section is to study some of the essential properties of expected value. Unless otherwise noted, we will assume that the indicated expected values exist. We start with two trivial but still essential results.

A constant \(c\) can be thought of as a random variable (on any probability space) that takes only the value \(c\) with probability 1. The corresponding distribution is sometimes called point mass at \(c\). With this understanding, \(\E(c) = c\).

Let \(X\) be an indicator random variable (that is, a variable that takes only the values 0 and 1). Then \(\E(X) = \P(X = 1)\).

Proof:

\( X \) is discrete so by definition, \( \E(X) = 1 \cdot \P(X = 1) + 0 \cdot \P(X = 0) = \P(X = 1) \).

In particular, if \(\bs{1}_A\) is the indicator variable of an event \(A\), then \(\E(\bs{1}_A) = \P(A)\), so in a sense, expected value subsumes probability. For a book that takes expected value, rather than probability, as the fundamental starting concept, see Probability via Expectation, by Peter Whittle.

Change of Variables Theorem

The expected value of a real-valued random variable gives the center of the distribution of the variable. This idea is much more powerful than might first appear. By finding expected values of various functions of a general random variable, we can measure many interesting features of its distribution.

Thus, suppose that \(X\) is a random variable taking values in a general set \(S\), and suppose that \(r\) is a function from \(S\) into \(\R\). Then \(r(X)\) is a real-valued random variable, and so it makes sense to compute \(\E[r(X)]\). However, to compute this expected value from the definition would require that we know the probability density function of the transformed variable \(r(X)\) (a difficult problem, in general). Fortunately, there is a much better way, given by the change of variables theorem for expected value. This theorem is sometimes referred to as the law of the unconscious statistician, presumably because it is so basic and natural that it is often used without the realization that it is a theorem, and not a definition.

If \(X\) has a discrete distribution on a countable set \(S\) with probability density function \(f\). then

\[ \E[r(X)] = \sum_{x \in S} r(x) f(x) \]
Proof:

Let \(Y = r(X)\) and let \(T \subseteq \R\) denote the range of \(r\). Then \(T\) is countable so \(Y\) has a discrete distribution. Thus

\[ \E(Y) = \sum_{y \in T} y \, \P(Y = y) = \sum_{y \in T} y \, \sum_{x \in r^{-1}\{y\}} f(x) = \sum_{y \in T} \sum_{x \in r^{-1}\{y\}} r(x) f(x) = \sum_{x \in S} r(x) f(x) \]
DiscreteDiscrete.png

Similarly, if \(X\) has a continuous distribution on \(S \subseteq \R^n\) with probability density function \(f\), then

\[ \E[r(X)] = \int_S r(x) f(x) dx \]

We will prove the continuous version in stages, through Theorem 3, Exercise 53, and Exercise 56. Even though the proof is delayed, however, we will use the change of variables theorem in the proofs of many of the other properties of expected value.

The change of variables theorem holds if \(X\) has a continous distribution and \(r\) has countable range.

Proof:

Let \(T\) denote the set of values of \(Y = r(X)\). \(T\) is countable so \(Y\) has a discrete distribution. Thus

\[ \E(Y) = \sum_{y \in T} y \, \P(Y = y) = \sum_{y \in T} y \, \int_{r^{-1}\{y\}} f(x) \, dx = \sum_{y \in T} \int_{r^{-1}\{y\}} r(x) f(x) \, dx = \int_{S} r(x) f(x) \, dx \]
ContinuousDiscrete.png

The exercises below gives basic properties of expected value. These properties are true in general, but we will restrict the proofs primarily to the continuous case. The change of variables theorem is the main tool we will need. In these theorems \(X\) and \(Y\) are real-valued random variables for an experiment and \(c\) is a constant. We assume that the indicated expected values exist.

Linearity

\(\E(X + Y) = \E(X) + \E(Y)\)

Proof:

We apply the change of variables theorem with the function \(r(x, y) = x + y\). Suppose that \( (X, Y) \) has a continuous distribution with PDF \( f \), and that \( X \) takes values in \( S \subseteq \R \) and \( Y \) takes values in \( T \subseteq \R \). Recall that \( X \) has PDF \( g(x) = \int_T f(x, y) \, dy \) for \( x \in S \) and \( Y \) has PDF \( h(y) = \int_S f(x, y) \, dx \) for \( y \in T \). Thus

\[ \begin{align} \E(X + Y) & = \int_{S \times T} (x + y) f(x, y) \, d(x, y) = \int_{S \times T} x f(x, y) \, d(x, y) + \int_{S \times T} y f(x, y) \, d(x, y) \\ & = \int_S x \left( \int_T f(x, y) \, dy \right) \, dx + \int_T y \left( \int_S f(x, y) \, dx \right) \, dy = \int_S x g(x) \, dx + \int_T y h(y) \, dy = \E(X) + \E(Y) \end{align}\]

\(\E(c X) = c \, \E(X)\)

Proof:

We apply the change of variables theorem with the function \(r(x) = c \, x\). Suppose that \( X \) has a continuous distribution on \( S \subseteq \R \) with PDF \( f \). Then

\[ \E(c X) = \int_S c \, x f(x) \, dx = c \int_S x f(x) \, dx = c \E(X) \]

Suppose that \((X_1, X_2, \ldots)\) is a sequence of real-valued random variables for our experiment and that \((a_1, a_2, \ldots, a_n)\) is a sequence of constants. Then, as a consequence of the previous two results,

\[\E\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i \E(X_i)\]

Thus, expected value is a linear operation. The linearity of expected value is so basic that it is important to understand this property on an intuitive level. Indeed, it is implied by the interpretation of expected value given in the law of large numbers.

Suppose that \((X_1, X_2, \ldots, X_n)\) is a sequence of real-valued random variables, with common mean \(\mu\). If the random variables are also independent and identically distributed, then in statistical terms, the sequence is a random sample of size \(n\) from the common distribution.

  1. Let \(Y = \sum_{i=1}^n X_i\), the sum of the variables. Then \(\E(Y) = n \mu\).
  2. Let \(M = \frac{1}{n} \sum_{i=1}^n X_i\), the average of the variables. Then \(\E(M) = \mu\).
Proof:

For part (a),

\[ \E(Y) = \E\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \E(X_i) = \sum_{i=1}^n \mu = n \mu \]

For part (b), note that \( M = Y / n \). Hence \( \E(M) = \E(Y) / n = \mu \).

In several important cases, a random variable from a special distribution can be decomposed into a sum of simpler random variables, and then part (a) of Theorem 7 can be used to compute the expected value.

Inequalities

The following exercises give some basic inequalities for expected value. The first is the most obvious, but is also the main tool for proving the others.

Suppose that \(\P(X \ge 0) = 1\). Then

  1. \(\E(X) \ge 0\)
  2. If \( \P(X \gt 0) \gt 0 \) then \( \E(X) \gt 0 \).
Proof:

Part (a) follows from the definition. We can take the set of values \( S \) of \( X \) to be a subset of \( [0, \infty) \). For part (b), suppose that \( \P(X \gt 0) \gt 0 \) (in addition to \( \P(X \ge 0) = 1 \)). By the continuity theorem for increasing events, there exists \( \epsilon \gt 0 \) such that \( \P(X \ge \epsilon) \gt 0 \). Therefore \( X - \epsilon \bs{1}(X \ge \epsilon) \ge 0 \) (with probability 1). By part (a), linearity, and Theorem 2, \( \E(X) - \epsilon \P(X \ge \epsilon) \gt 0 \) so \( \E(X) \ge \epsilon \P(X \ge \epsilon) \gt 0 \).

Suppose that \(\P(X \le Y) = 1\). Then

  1. \(\E(X) \le \E(Y)\)
  2. If \( \P(X \lt Y) \gt 0 \) then \( \E(X) \lt \E(Y) \).
Proof:

The assumption is equivalent to \( \P(Y - X \ge 0) = 1 \). Thus \( \E(Y - X) \ge 0 \) by part (a) of Theorem 8. But then \( \E(Y) - \E(X) \ge 0 \) by the linearity of expected value. Similarly, part (b) follows from part (b) of Theorem 8.

Thus, expected value is an increasing operator. This is perhaps the second most important property of expected value, after linearity.

Absolute value inequalities:

  1. \(|\E(X)| \le \E(|X|)\)
  2. If \( \P(X \gt 0) \gt 0 \) and \( \P(X \lt 0) \gt 0 \) then \( |\E(X)| \lt \E(|X|) \).
Proof:

\( -|X| \le X \le |X| \) (with probability 1) so by Theorem 9 (a), \( \E(-|X|) \le \E(X) \le \E(|X|) \). By linearity, \( -\E(|X|) \le \E(X) \le \E(|X|) \) which implies \( |\E(X)| \le \E(|X|) \). For part (b), if \( \P(X \gt 0) \gt 0 \) then \( \P(-|X| \lt X ) \gt 0 \), and if \( \P(X \lt 0) \gt 0 \) then \( \P(X \lt |X|) \gt 0 \). Hence by Theorem 9 (b), \( -\E(|X|) \lt \E(X) \lt \E(|X|) \) and therefore \( |\E(X)| \lt \E(|X|) \).

Only in Lake Woebegone are all of the children above average:

If \( \P[X \ne \E(X)] \gt 0 \) then

  1. \(\P[X \gt \E(X)] \gt 0\)
  2. \(\P[X \lt \E(X)] \gt 0\)
Proof:

We prove the contrapositive. Thus suppose that \( \P[X \gt \E(X)] = 0 \) so that \( \P[X \le \E(X)] = 1 \). If \( \P[X \lt \E(X)] \gt 0 \) then by Theorem 9 we have \( \E(X) \lt \E(X) \), a contradiction. Thus \( \P[X = \E(X)] = 1\). Similarly, if \( \P[X \lt \E(X)] = 0 \) then \( \P[X = \E(X)] = 1 \).

Thus, if \( X \) is not a constant (with probability 1), then \( X \) must take values greater than its mean with positive probability and values less than its mean with positive probability.

Symmetry

Suppose that the distribution of \( X \) is symmetric about \( a \in \R \). That is, the distribution of \( a - X \) is the same as the distribution of \( X - a \). If \(\E(X)\) exists, then \(\E(X) = a\).

Proof:

Since \( \E(X) \) exists we have \( \E(a - X) = \E(X - a) \) so by linearity \( a - \E(X) = \E(X) - a \). Equivalently \( 2 \E(X) = 2 a \).

The previous result applies if \(X\) has a continuous distribution on \(\R\) with a probability density \(f\) that is symmetric about \(a\): \(f(a + t) = f(a - t)\) for \(t \in \R\).

Independence

If \(X\) and \(Y\) are independent real-valued random variables then \(\E(X Y) = \E(X) \E(Y)\).

Proof:

We apply the change of variables theorem with the function \(r(x, y) = x y\). Suppose that \( X \) has a continuous distribution on \( S \subseteq \R \) with PDF \( g \) and that \( Y \) has a continuous distribution on \( T \subseteq \R \) with PDF \( h \). Then \( (X, Y) \) has PDF \( f(x, y) = g(x) h(y) \) on \( S \times T \). Hence

\[ \E(X Y) = \int_{S \times T} x y f(x, y) \, d(x, y) = \int_{S \times T} x y g(x) h(y) \, d(x, y) = \int_S x g(x) \, dx \int_T y h(y) \, dy = \E(X) \E(Y) \]

It follows from the last exercise that independent random variables are uncorrelated (a concept that we will study in a later section). Moreover, this result is more powerful than might first appear. Suppose that \(X\) and \(Y\) are independent random variables taking values in general spaces \(S\) and \(T\) respectively, and that \(u: S \to \R\) and \(v: T \to \R\). Then \(u(X)\) and \(v(Y)\) are independent, real-valued random variables and hence

\[ \E[u(X) v(Y)] = \E[u(X)] \E[v(Y)] \]

Examples and Applications

Uniform Distributions

Suppose that \(X\) has the discrete uniform distribution on a finite set \(S \subseteq \R\).

  1. \(\E(X)\) is the arithmetic average of the numbers in \(S\).
  2. If \(X\) is uniformly distributed on \(\{m, m + 1, \ldots, n\}\) where \(m \le n\), then \(\E(X) = \frac{m + n}{2}\), the average of the endpoints.

Suppose that \(X\) has the continuous uniform distribution on an interval \([a, b]\).

  1. \(\E(X) = \frac{a + b}{2}\), the midpoint of the interval.
  2. Find a general formula for the moments of \(X\).
Answer:
  1. \(\E(X^n) = \frac{b^{n+1} - a^{n+1}}{(n + 1)(b - a)}\)

Suppose that \(X\) is uniformly distributed on the interval \([a, b]\), and that \(g\) is an integrable function from \([a, b]\) into \(\R\). Then \(\E[g(X)]\) is the average value of \(g\) on \([a, b]\), as defined in calculus.

\[ \E[g(X)] = \frac{1}{b - a} \int_a^b g(x) dx \]
Proof:

This result follows immediately from the change of variables theorem, since \( X \) has PDF \( f(x) = 1 / (b - a) \) for \( a \le x \le b \).

Find the average value of the following functions on the given intervals:

  1. \(f(x) = x\) on \([2, 4]\)
  2. \(g(x) = x^2\) on \([0, 1]\)
  3. \(h(x) = \sin(x)\) on \([0, \pi]\).
Answer:
  1. \(3\)
  2. \(\frac{1}{3}\)
  3. \(\frac{2}{\pi}\)

Suppose that \(X\) is uniformly distributed on \([-1, 3]\).

  1. Find the probability density function of \(X^2\).
  2. Find \(E(X^2)\) using the probability density function in (a).
  3. Find \(E(X^2)\) using the change of variables theorem.
Answer:
  1. \(g(y) = \begin{cases} \frac{1}{4} y^{-1/2}, & 0 \lt y \lt 1 \\ \frac{1}{8} y^{-1/2}, & 1 \lt y \lt 9 \end{cases}\)
  2. \(\frac{7}{3}\)
  3. \(\frac{7}{3}\)

Dice

Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 have probability \(\frac{1}{4}\) each, and faces 2, 3, 4, and 5 have probability \(\frac{1}{8}\) each.

Two standard, fair dice are thrown, and the scores \((X_1, X_2)\) recorded. Find the expected value of each of the following variables.

  1. \(Y = X_1 + X_2\), the sum of the scores.
  2. \(M = \frac{1}{2} (X_1 + X_2)\), the average of the scores.
  3. \(Z = X_1 X_2\), the product of the scores.
  4. \(U = \min\{X_1, X_2\}\), the minimum score
  5. \(V = \max\{X_1, X_2\}\), the maximum score.
Answer:
  1. \(7\)
  2. \(\frac{7}{2}\)
  3. \(\frac{49}{4}\)
  4. \(\frac{101}{36}\)
  5. \(\frac{19}{4}\)

In the dice experiment, select two fair die. Note the shape of the probability density function and the location of the mean for the sum, minimum, and maximum variables. Run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean for each of these variables.

Repeat Exercise 19 for ace-six flat dice.

Answer:
  1. \(7\)
  2. \(\frac{7}{2}\)
  3. \(\frac{49}{4}\)
  4. \(\frac{77}{32}\)
  5. \(\frac{147}{32}\)

Repeat Exercise 20 for ace-six flat dice.

Bernoulli Trials

Recall that a Bernoulli trials process is a sequence \(\bs{X} = (X_1, X_2, \ldots)\) of independent, identically distributed indicator random variables. In the usual language of reliability, \(X_i\) denotes the outcome of trial \(i\), where 1 denotes success and 0 denotes failure. The probability of success \(p = \P(X_i = 1)\) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate chapter on the Bernoulli Trials explores this process in detail.

The number of successes in the first \(n\) trials is \(Y = \sum_{i=1}^n X_i\). Recall that this random variable has the binomial distribution with parameters \(n\) and \(p\), and has probability density function

\[ f(y) = \binom{n}{y} p^y (1 - p)^{n - y}, \quad y \in \{0, 1, \ldots, n\} \]

If \(Y\) has the binomial distribution with parameters \(n\) and \(p\) then \(\E(Y) = n p\)

Proof:

This result can be proved, of course, from the definition of expected value. The critical identity is \(y \binom{n}{y} = n \binom{n - 1}{y - 1}\). A better proof is based on the representation of \(Y\) as a sum of indicator variables. The result follows immediately from Theorem 7, since \( \E(X_i) = p \) for each \( i \in \N_+ \).

In the binomial coin experiment, vary \(n\) and \(p\) and note the shape of the probability density function and the location of the mean. For selected values of \(n\) and \(p\), run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean.

Now let \(N\) denote the trial number of the first success. This random variable has the geometric distribution on \(\N_+\) with parameter \(p\), and has probability density function.

\[ g(n) = p (1 - p)^{n-1}, \quad n \in \N_+ \]

If \(N\) has the geometric distribution on \(\N_+\) with parameter \(p\) then \(\E(N) = 1 / p\).

Proof:

The key is the formula for the deriviative of a geometric series:

\[ \E(N) = \sum_{n=1}^\infty n p (1 - p)^{n-1} = -p \frac{d}{dp} \sum_{n=0}^\infty (1 - p)^n = -p \frac{d}{dp} \frac{1}{p} = p \frac{1}{p^2} = \frac{1}{p}\]

In the negative binomial experiment, select \(k = 1\) to get the geometric distribution. Vary \(p\) and note the shape of the probability density function and the location of the mean. For selected values of \(p\), run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean.

The Hypergeometric Distribution

Suppose that a population consists of \(m\) objects; \(r\) of the objects are type 1 and \(m - r\) are type 0. A sample of \(n\) objects is chosen at random, without replacement. Let \(X_i\) denote the type of the \(i\)th object selected. Recall that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a sequence of identically distributed (but not independent) indicator random variables. In fact the sequence is exchangeable.

Let \(Y\) denote the number of type 1 objects in the sample, so that \(Y = \sum_{i=1}^n X_i\). Recall that \(Y\) has the hypergeometric distribution, which has probability density function.

\[ f(y) = \frac{\binom{r}{y} \binom{m - r}{n - y}}{\binom{m}{n}}, \quad y \in \{0, 1, \ldots, n\} \]

If \(Y\) has the hypergeometric distribution with parameters \(m\), \(n\), and \(r\) then \(\E(Y) = n \frac{r}{m}\).

Proof:

This result can be proved, of course, from the definition of expected value. A much simpler proof uses the linearity of expected value and the representation of \(Y\) as a sum of indicator variables. The result follows immediately from Theorem 7, since \( \E(X_i) = r / m \) for each \( i \in \{1, 2, \ldots n\} \).

In the ball and urn experiment, vary \(n\), \(r\), and \(m\) and note the shape of the probability density function and the location of the mean. For selected values of the parameters, run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean.

Note that if we select the objects with replacement, then \(\bs{X}\) would be a sequence of Bernoulli trials, and hence \(Y\) would have the binomial distribution with parameters \(n\) and \(p = \frac{r}{m}\). Thus, the mean would still be \(\E(Y) = n \frac{r}{m}\).

The Poisson Distribution

Recall that the Poisson distribution has density function

\[ f(n) = e^{-a} \frac{a^n}{n!}, n \quad \N \]

where \(a \gt 0\) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number of random points in a region of time or space; the parameter \(a\) is proportional to the size of the region. The Poisson distribution is studied in detail in the chapter on the Poisson Process.

If \(N\) has the Poisson distribution with parameter \(a\) then \(\E(N) = a\). Thus, the parameter of the Poisson distribution is the mean of the distribution.

Proof:

\[ \E(N) = \sum_{n=0}^\infty n e^{-a} \frac{a^n}{n!} = e^{-a} \sum_{n=1}^\infty \frac{a^n}{(n - 1)!} = e^{-a} a \sum_{n=1}^\infty \frac{a^{n-1}}{(n-1)!} = e^{-a} a e^a = a.\]

In the Poisson experiment, the parameter is \(a = r t\). Vary the parameter and note the shape of the probability density function and the location of the mean. For various values of the parameter, run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean.

The Exponential Distribution

Recall that the exponential distribution is a continuous distribution with probability density function

\[ f(t) = r e^{-r t}, \quad 0 \le t \lt \infty \]

where \(r \gt 0\) is the rate parameter. This distribution is widely used to model failure times and other arrival times; in particular, the distribution governs the time between arrivals in the Poisson model. The exponential distribution is studied in detail in the chapter on the Poisson Process.

Suppose that \(T\) has the exponential distribution with rate parameter \(r\). Then \( \E(T) = 1 / r \).

Proof:

This result follows from the definition and an integration by parts:

\[ \E(T) = \int_0^\infty t r e^{-r t} \, dt = -t e^{-r t} \bigg|_0^\infty + \int_0^\infty e^{-r t} \, dt = 0 - \frac{1}{r} e^{-rt} \bigg|_0^\infty = \frac{1}{r} \]

Recall that the mode of \( T \) is 0 and the median of \( T \) is \( \ln(2) / r \). Note how these measures of center are ordered: \(0 \lt \ln(2) / r \lt 1 / r\)

In the special distribution simulator, select the gamma distribution. Set \(k = 1\) to get the exponential distribution. Vary \(r\) with the scroll bar and note the position of the mean relative to the graph of the probability density function. For selected values of \(r\), run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean.

Suppose again that \(T\) has the exponential distribution with rate parameter \(r\) and suppose that \(t \gt 0\). Find \(\E(T \mid T \gt t)\).

Answer:

\(t + \frac{1}{r}\)

The Gamma Distribution

Recall that the gamma distribution is a continuous distribution with probability density function

\[ f(t) = r^n \frac{t^{n-1}}{(n - 1)!} e^{-r t}, \quad 0 \le t \lt \infty\]

where \(n \in N_+\) is the shape parameter and \(r \gt 0\) is the rate parameter. This distribution is widely used to model failure times and other arrival times. The gamma distribution is studied in detail in the chapter on the Poisson Process. In particular, if \((X_1, X_2, \ldots, X_n)\) is a sequence of independent random variables, each having the exponential distribution with rate parameter \(r\), then \(T = \sum_{i=1}^n X_i\) has the gamma distribution with shape parameter \(n\) and rate parameter \(r\).

Suppose that \(T\) has the gamma distribution with shape parameter \(n\) and rate parameter \(r\). Then \(\E(T) = n / r\).

Proof:

This result can be proved from the definition of expected value, but a better proof uses the representation of \(T\) as a sum of exponential variables. The result follows immediately from Theorem 7.

In the special distribution simulator, select the gamma distribution. Vary the parameters and note the position of the mean relative to the graph of the probability density function. For selected parameter values, run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean.

Beta Distributions

The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions.

Suppose that \(X\) has probability density function \(f(x) = 3 x^2\) for \(0 \le x \le 1\).

  1. Find the mean of \(X\).
  2. Find the mode of \(X\).
  3. Find the median of \(X\).
  4. Sketch the graph of \(f\) and show the location of the mean, median, and mode on the \(x\)-axis.
Answer:
  1. \(\frac{3}{4}\)
  2. \(1\)
  3. \(\left(\frac{1}{2}\right)^{1/3}\)

In the special distribution simulator, select the beta distribution and set \(a = 3\) and \(b = 1\) to get the distribution in the last exercise. Run the experiment 1000 times and note the apparent convergence of the sample mean to the distribution mean.

Suppose that a sphere has a random radius \(R\) with probability density function \(f(r) = 12 r ^2 (1 - r)\) for \(0 \le r \le 1\). Find the expected value of each of the following:

  1. The volume \(V = \frac{4}{3} \pi R^3\)
  2. The surface area \(A = 4 \pi R^2\)
  3. The circumference \(C = 2 \pi R\)
Answer:
  1. \(\frac{8}{21} \pi\)
  2. \(\frac{8}{5} \pi\)
  3. \(\frac{6}{5} \pi\)

Suppose that \(X\) has probability density function \(f(x) = \frac{1}{\pi \sqrt{x (1 - x)}}\) for \(0 \lt x \lt 1\). This particular beta distribution is also known as the arcsine distribution.

  1. Find the mean of \(X\).
  2. Find median of \(X\).
  3. Note that \(f\) is unbounded, so \(X\) does not have a mode.
  4. Sketch the graph of \(f\) and show the location of the mean and median on the \(x\)-axis.
Answer:
  1. \(\frac{1}{2}\)
  2. \(\frac{1}{2}\)

The Pareto Distribution

Recall that the Pareto distribution is a continuous distribution with probability density function

\[ f(x) = \frac{a}{x^{a + 1}}, \quad 1 \le x \lt \infty \]

where \(a \gt 0\) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special Distributions.

Suppose that \(X\) has the Pareto distribution with shape parameter \(a\). Then

  1. \(\E(X) = \infty\) if \(0 \lt a \le 1\)
  2. \(\E(X) = \frac{a}{a - 1}\) if \(a \gt 1\)
Proof:

If \( a \ne 1 \),

\[ \E(X) = \int_1^\infty x \frac{a}{x^{a+1}} \, dx = \int_1^\infty \frac{a}{x^a} \, dx = \frac{a}{-a + 1} x^{-a + 1} \bigg|_1^\infty = \begin{cases} \infty, & 0 \lt a \lt 1 \\ \frac{a}{a - 1}, & 1 \lt a \lt \infty \end{cases} \]

If \( a = 1 \), \( \E(X) = \int_1^\infty x \frac{1}{x^2} \, dx = \int_1^\infty \frac{1}{x} \, dx = \ln(x) \bigg|_1^\infty = \infty \).

In the special distribution simulator, select the Pareto distribution. For the following values of the shape parameter \(a\), run the experiment 1000 times and note the behavior of the empirical mean.

  1. \(a = 1\)
  2. \(a = 2\)
  3. \(a = 3\).

The Cauchy Distribution

Recall that the Cauchy distribution has probability density function

\[ f(x) = \frac{1}{\pi (1 + x^2)}, \quad x \in \R \]

This distribution is named for Augustin Cauchy. The Cauchy distributions is studied in detail in the chapter on Special Distributions.

If \(X\) has the Cauchy distribution then \( \E(X) \) does not exist.

Proof:

By definition,

\[ \E(X) = \int_{-\infty}^\infty x \frac{1}{\pi (1 + x^2)} \, dx = \frac{1}{2 \pi} \ln(1 + x^2) \bigg|_{-\infty}^\infty \]

The two limits in the improper integral must exist independently. But the limit at \( \infty \) and \( -\infty \) are both \( \infty \).

Note that the graph of \( f \) is symmetric about 0 and is unimodal. Thus, the mode and median of \( X \) are both 0. By Theorem 12, if \( X \) had a mean, the mean would be 0 also, but alas the mean does not exist. Moreover, the non-existence of the mean is not just a pedantic technicality. If we think of the probability distribution as a mass distribution, then the moment to the right of \( a \) is \( \int_a^\infty (x - a) f(x) \, dx = \infty \) and the moment to the left of \( a \) is \( \int_{-\infty}^a (x - a) f(x) \, dx = -\infty \) for every \( a \in \R \). The center of mass simply does not exist. Probabilisitically, the law of large numbers fails, as you can see in the following simulation exercise:

In the special distribution simulator, select the Cauchy distribution. Run the simulation 1000 times and note the behavior of the empirical mean.

The Normal Distribution

Recall that the standard normal distribution is a continuous distribution with density function

\[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}, \quad z \in \R \]

Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in the chapter on Special Distributions.

If \(Z\) has the standard normal distribution then \( \E(X) = 0 \).

Proof:

Using a simple change of variables, we have

\[ \E(Z) = \int_{-\infty}^\infty z \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2} \, dz = - \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2} \bigg|_{-\infty}^\infty = 0 - 0 \]

The standard normal distribution is unimodal and symmetric about \( 0 \). Thus, the median, mean, and mode all agree.

Suppose again that \(Z\) has the standard normal distribution and that \(\mu \in (-\infty, \infty)\) and \(\sigma \in (0, \infty)\). Recall that \(X = \mu + \sigma Z\) has the normal distribution with location parameter \(\mu\) and scale parameter \(\sigma\). Then \(\E(X) = \mu\), so that the location parameter is the mean.

In the special distribution simulator, select the normal distribution. Vary the parameters and note the location of the mean. For selected parameter values, run the simulation 1000 times and note the apparent convergence of the empirical mean to the true mean.

Additional Exercises

Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Find the following expected values:

  1. \(\E(X)\)
  2. \(\E(X^2 Y)\)
  3. \(\E(X^2 + Y^2)\)
  4. \(\E(X Y \mid Y \gt X)\)
Answer:
  1. \(\frac{7}{12}\)
  2. \(\frac{17}{72}\)
  3. \(\frac{5}{6}\)
  4. \(\frac{1}{3}\)

Suppose that \(N\) has a discrete distribution with probability density function \(f(n) = \frac{1}{50} n^2 (5 - n)\) for \(n \in \{1, 2, 3, 4\}\). Find each of the following:

  1. The median of \(N\).
  2. The mode of \(N\)
  3. \(\E(N)\).
  4. \(\E(N^2)\)
  5. \(\E(1 / N)\).
  6. \(\E(1 / N^2)\).
Answer:
  1. 3
  2. 3
  3. \(\frac{73}{25}\)
  4. \(\frac{47}{5}\)
  5. \(\frac{2}{5}\)
  6. \(\frac{1}{5}\)

Suppose that \(X\) and \(Y\) are real-valued random variables with \(\E(X) = 5\) and \(\E(Y) = -2\). Find \(\E(3 X + 4 Y - 7)\).

Answer:

0

Suppose that \(X\) and \(Y\) are real-valued, independent random variables, and that \(\E(X) = 5\) and \(\E(Y) = -2\). Find \(\E[(3 X - 4) (2 Y + 7)]\).

Answer:

33

Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at random and shoots. Find the expected number of ducks killed.

Answer:

Express the number of ducks killed \(N\) as a sum of indicator random variables. Then \(\E(N) = 10 \left[1 - \left(\frac{9}{10}\right)^5\right] = 4.095\)

For a more complete analysis of the duck hunter problem, see The Number of Distinct Sample Values in the chapter on Finite Sampling Models.

Additional Properties

Special Results for Nonnegative Variables

The inequality in the next exercise is known as Markov's inequality (named after Andrei Markov). It gives an upper bound for the tail probability of a nonnegative random variable in terms of the expected value of the variable.

If \(X\) is a nonnegative random variable, then

\[ \P(X \ge x) \le \frac{\E(X)}{x}, \quad x \gt 0 \]
Proof:

Note that \(x \bs{1}(X \ge x) \le X\). Taking expected values through this inequality gives \(x \P(X \ge x) \le \E(X)\).

If \(X\) is a nonnegative random variable then

\[ \E(X) = \int_0^\infty \P(X \gt x) \, dx \]
Proof:

A proof can be constructed by expressing \(\P(X \gt x)\) in terms of the probability density function of \(X\), as a sum in the discrete case or an integral in the continuous case. Then in the expression \( \int_0^\infty \P(X \gt x) \, dx \) interchange the integral and the sum (in the discrete case) or the two integrals (in the continuous case). There is a much more elegant proof if we use the fact that we can interchange expected values and integrals when the integrand is nonnegative (this is a special case of Fubini's theorem).

\[ \int_0^\infty \P(X \gt x) \, dx = \int_0^\infty \E[\bs{1}(X \gt x)] \, dx = \E \left(\int_0^\infty \bs{1}(X \gt x) \, dx \right) = \E\left( \int_0^X 1 \, dx \right) = \E(X) \]

We can now prove another special case of the change of variables theorem.

The change of variables formula for expected value holds when the random variable \(X\) has a continuous distribution on \(S \subseteq \R^n\) with probability density function \(f\), and \(r: S \to [0, \infty)\).

Proof:

From Exercise 53,

\[ \E[r(X)] = \int_0^\infty \P[r(X) \gt t] \, dt = \int_0^\infty \int_{r^{-1}(t, \infty)} f(x) \, dx \, dt = \int_S \int_0^{r(x)} f(x) \, dt \, dx = \int_S r(x) f(x) \, dx \]

The following result is similar to Exercise 53, but is specialized to nonnegative integer valued variables:

Suppose that \(N\) has a discrete distribution, taking values in \(\N\). Then

\[ \E(N) = \sum_{n=0}^\infty \P(N \gt n) = \sum_{n=1}^\infty \P(N \ge n) \]
Proof:

First, the two sums on the right are equivalent by a simple change of variables. A proof can be constructed by expressing \(\P(N \gt n)\) as a sum in terms of the probability density function of \(N\). Then in the expression \( \sum_{n=0}^\infty \P(N \gt n) \) interchange the two sums. A more elegant proof using the interchange of expected value and sum is

\[ \sum_{n=1}^\infty \P(N \ge n) = \sum_{n=1}^\infty \E[\bs{1}(N \ge n)] = \E\left(\sum_{n=1}^\infty \bs{1}(N \ge n) \right) = \E\left(\sum_{n=1}^N 1 \right) = \E(N) \]

A General Definition of Expected Value

The result in Exercise 53 could be used as the basis of a general formulation of expected value that would work for discrete, continuous, or even mixed distributions, and would not require the assumption of the existence of probability density functions. First, the result in Exercise 53 is taken as the definition of \(\E(X)\) if \(X\) is nonnegative. Next, for \(x \in \R\), recall that the positive and negative parts of \(x\) are defined as

\[ x^+ = \max\{x, 0\}, \quad x^- = \max\{0, -x\} \]

For \(x \in \R\),

  1. \(x^+ \ge 0\), \(x^- \ge 0\)
  2. \(x = x^+ - x^-\)
  3. \(|x| = x^+ + x^-\)

Now, if \(X\) is a real-valued random variable, then \(X^+\) and \(X^-\), the positive and negative parts of \(X\), are nonnegative random variables. Thus, assuming that \(\E(X^+) \lt \infty\) or \(\E(X^-) \lt \infty\) we would define (anticipating linearity)

\[ \E(X) = \E(X^+) - \E(X^-) \]

The usual formulas for expected value in terms of the probability density function, for discrete, continuous, or mixed distributions, would now be proven as theorems. Essentially this would be Exercise 53 with the hypotheses and conclusions reversed.

We can finally finish our proof of the change of variables formula for expected value

The change of variables formula holds when \(X\) has a continuous distribution on \(S\) with probability function \(f\), and \(r: S \to \R\).

Proof:

From Exercise 54,

\[ \begin{align} \E[r(X)] & = \E[r^+(X) - r^-(X)] = \E[r^+(X)] - \E[r^-(X)] \\ & = \int_S r^+(x) f(x) \, dx - \int_S r^-(x) f(x) \, dx = \int_S [r^+(x) - r^-(x)] f(x) \, dx = \int_S r(x) f(x) \, dx \end{align} \]

Jensens's Inequality

Our next sequence of exercises will establish an important inequality known as Jensen's inequality, named for Johan Jensen. First we need a definition. A real-valued function \(g\) defined on an interval \(S \subseteq \R\) is said to be convex on \(S\) if for each \(t \in S\), there exist numbers \(a\) and \(b\) (that may depend on \(t\)), such that

The graph of \(x \mapsto a + b x\) is called a supporting line for \( g \) at \(t\). Thus, a convex function has at least one supporting line at each point in the domain

A convex function

You may be more familiar with convexity in terms of the following theorem from calculus: If \(g\) has a continuous, non-negative second derivative on \(S\), then \(g\) is convex on \(S\) (since the tangent line at \(t\) is a supporting line at \(t\) for each \(t \in S\)). The next exercise givens the single variable version of Jensen's inequality

If \(X\) takes values in an interval \(S\) and \(g: S \to \R\) is convex on \(S\), then

\[ \E[g(X)] \ge g[\E(X)] \]
Proof:

Note that \( \E(X) \in S \) so let \( y = a + b x \) be a supporting line for \( g \) at \( \E(X) \). Thus \(a + b \E(X) = g[\E(X)]\) and \(a + b \, X \le g(X)\). Taking expected values through the inequality gives

\[ a + b \, \E(X) = g[\E(X)] \le \E[g(X)] \]

Jensens's inequality extends easily to higher dimensions. The 2-dimensional version is particularly important, because it will be used to derive several special inequalities in the next section. First, a subset \(S \subseteq \R^n\) is convex if for every pair of points in \(S\), the line segment connecting those points also lies in \(S\). That is, if \(\bs{x} \in S\), \(\bs{y} \in S\), and \(p \in [0, 1]\) then \(p \bs{x} + (1 - p) \bs{y} \in S\).

A convex set

Next, a real-valued function \(g\) on \(S\) is said to be convex if for each \(\bs{t} \in S\), there exist \(a \in \R\) and \(\bs{b} \in \R^n\) (depending on \(\bs{t}\)) such that

The graph of \(\bs{x} \mapsto a + \bs{b} \cdot \bs{x}\) is called a supporting hyperplane for \( g \) at \(\bs{t}\) (in \( \R^2 \) it's an ordinary plane). From calculus, if \(g\) has continuous second derivatives on \(S\) and has a positive non-definite second derivative matrix, then \(g\) is convex on \(S\).

Suppose now that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) takes values in \(S \subseteq \R^n\), and let \(\E(\bs{X}) = (\E(X_1), \E(X_2), \ldots, \E(X_n))\). The general version of Jensen's inequlaity is given in the next exercise.

If \(S\) is convex and \(g: S \to \R\) is convex on \(S\) then

\[ \E[g(\bs{X})] \ge g[\E(\bs{X})] \]
Proof:

First \( \E(\bs{X}) \in S \), so let \( y = a + \bs{b} \cdot \bs{x} \) be a supporting hyperplane for \( g \) at \( \E(\bs{X}) \). Thus \(a + \bs{b} \cdot \E(\bs{X}) = g[\E(\bs{X})]\) and \(a + \bs{b} \cdot \bs{X} \le g(\bs{X})\). Taking expected values through the inequality gives

\[ a + \bs{b} \cdot \E(\bs{X}) = g[\E(\bs{X})] \le \E[g(\bs{X})] \]

We will study the expected value of random vectors and matrices in more detail in a later section. In both the one and \(n\)-dimensional cases, a function \(g: S \to \R\) is concave if the inequality in the definition is reversed. Jensen's inequality also reverses.

Exercises

Suppose that \(X\) has probability density function \(f(x) = r e^{-r x}\) for \(0 \le x \lt \infty\) where \(m \gt 0\). Thus, \(X\) has the exponential distribution with rate parameter \(r\).

  1. Verify that \(\E(X) = \frac{1}{r}\) using the formula in Exercise 53.
  2. Compute both sides of Markov's inequality.
Answer:
  1. \(e^{-r t} \lt \frac{1}{r t}\)

Suppose that \(N\) has probability density function \(f(n) = p (1 - p)^{n - 1}\) for \(n \in \N_+\) where \(0 \lt p \le 1\). Thus, \(N\) has the geometric distribution on \(\N_+\) with success parameter \(p\).

  1. Verify that \(\E(N) = \frac{1}{p}\) using the formula in Exercise 54.
  2. Compute both sides of Markov's inequality.
  3. Find \(\E(N \mid N \text{ is even })\).
Answer:
  1. \((1 - p)^{n-1} \lt \frac{1}{n p}, \quad n \in \N_+\)
  2. \(\frac{2 (1 - p)^2}{p (2 - p)^2}\)

Suppose that \(X\) has probability density function \(f(x) = \frac{a}{x^{a + 1}}\) for \(1 \le x \lt \infty\), where \(a \gt 1\). Thus, \(X\) has the Pareto distribution with shape parameter \(a\).

  1. Find \(\E(X)\) using the formula in Exercise 53.
  2. Find \(\E(1 / X)\).
  3. Show that \(x \mapsto 1 / x\) is convex on \((0, \infty)\).
  4. Verify Jensen's inequality by comparing (b) and the reciprocal of (a).
Answer:
  1. \(\frac{a}{a - 1}\)
  2. \(\frac{a}{a + 1}\)
  3. \(\frac{a}{a + 1} \gt \frac{a -1}{a}\)

Suppose that \((X, Y)\) has probability density function \(f(x, y) = 2 (x + y)\) for \(0 \le x \le y \le 1\).

  1. Show that the domain of \(f\) is a convex set.
  2. Show that \((x, y) \mapsto x^2 + y^2\) is convex on the domain of \(f\).
  3. Compute \(\E(X^2 + Y^2)\).
  4. Compute \([\E(X)]^2 + [\E(Y)]^2\).
  5. Verify Jensen's inequality by comparing (b) and (c).
Answer:
  1. \(\frac{5}{6}\)
  2. \(\frac{53}{72}\)
  3. \(\frac{5}{6} \gt \frac{53}{72}\)

Suppose that \(\{x_1, x_2, \ldots, x_n\}\) is a set of positive numbers. The arithmetic mean is at least as large as the geometric mean:

\[ \left(\prod_{i=1}^n x_i \right)^{1/n} \le \frac{1}{n}\sum_{i=1}^n x_i \]
Proof:

Let \(X\) be uniformly distributed on \(\{x_1, x_2, \ldots, x_n\}\). We apply Jensen's inequality with the concave function \(\ln\) on \((0, \infty)\):

\[ \E[\ln(X)] = \frac{1}{n} \sum_{i=1}^n \ln(x_i) = \ln \left[ \left(\prod_{i=1}^n x_i \right)^{1/n} \right] \le \ln[\E(X)] = \ln \left(\frac{1}{n}\sum_{i=1}^n x_i \right) \]

Taking exponentials of each side gives the inequality.