\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\N}{\mathbb{N}}\)
\(\newcommand{\Z}{\mathbb{Z}}\)
\(\newcommand{\E}{\mathbb{E}}\)
\(\newcommand{\P}{\mathbb{P}}\)
\(\newcommand{\var}{\text{var}}\)
\(\newcommand{\sd}{\text{sd}}\)
\(\newcommand{\cov}{\text{cov}}\)
\(\newcommand{\cor}{\text{cor}}\)
\(\newcommand{\bs}{\boldsymbol}\)

The central limit theorem and the law of large numbers are the two fundamental theorems of probability. Roughly, the central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. The importance of the central limit theorem is hard to overstate; indeed it is the reason that many statistical procedures work.

Suppose that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of independent, identically distributed, real-valued random variables with common probability density function \(f\), mean \(\mu\), and variance \(\sigma^2\). We assume that \(0 \lt \sigma \lt \infty\), so that in particular, the random variables really are *random* and not constants. Let
\[ Y_n = \sum_{i=1}^n X_i, \quad n \in \N \]
Note that by convention, \(Y_0 = 0\), since the sum is over an empty index set. The random process \(\bs{Y} = (Y_0, Y_1, Y_2, \ldots)\) is called the partial sum process associated with \(\bs{X}\). Special types of partial sum processes have been studied in many places in this project; in particular see

- the binomial distribution in the setting of Bernoulli trials
- the negative binomial distribution in the setting of Bernoulli trials
- the gamma distribution in the Poisson process
- the the arrival times in a general renewal process

Recall that in statistical terms, the sequence \(\bs{X}\) corresponds to sampling from the underlying distribution. In particular, \((X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution, and the corresponding sample mean is \[ M_n = \frac{Y_n}{n} = \frac{1}{n} \sum_{i=1}^n X_i \] By the law of large numbers, \(M_n \to \mu\) as \(n \to \infty\) with probability 1.

If \(m \le n\) then \(Y_n - Y_m\) has the same distribution as \(Y_{n-m}\). Thus the process \(\bs{Y}\) has stationary increments.

Note that \(Y_n - Y_m = \sum_{i=m+1}^n X_i\) and is the sum of \(n - m\) independent variables, each with the common distribution. Of course, \(Y_{n-m}\) is also the sum of \(n - m\) independent variables, each with the common distribution.

Note however that \(Y_n - Y_m\) and \(Y_{n-m}\) are very different random variables; the theorem simply states that they have the same *distribution*.

If \(n_1 \le n_2 \le n_3 \le \cdots\) then \(\left(Y_{n_1}, Y_{n_2} - Y_{n_1}, Y_{n_3} - Y_{n_2}, \ldots\right)\) is a sequence of independent random variables. Thus the process \(\bs{Y}\) has independent increments.

The terms in the sequence \(\bs{Y}\) are sums over disjoint collections of terms in the sequence \(\bs{X}\). Since the sequence \(\bs{X}\) is independent, so is the sequence \(\bs{Y}\).

Conversely, suppose that \(\bs{V} = (V_0, V_1, V_2, \ldots)\) is a random process with stationary, independent increments. Define \(U_i = V_i - V_{i-1}\) for \(i \in \N_+\). Then \(\bs{U} = (U_1, U_2, \ldots)\) is a sequence of independent, identically distributed variables and \(\bs{V}\) is the partial sum process associated with \(\bs{U}\).

Thus, partial sum processes are the only discrete-time random processes that have stationary, independent increments. An interesting, and much harder problem, is to characterize the continuous-time processes that have stationary independent increments. The Poisson counting process has stationary independent increments, as does the Brownian motion process.

If \(n \in \N\) then

- \(\E(Y_n) = n \mu\)
- \(\var(Y_n) = n \sigma^2\)

The results follow from basic properties of expected value and variance. Expected value is a linear operation so \( \E(Y_n) = \sum_{i=1}^n \E(X_i) = n \mu \). By independence, \(\var(Y_n) = \sum_{i=1}^n \var(X_i) = n \sigma^2\).

If \(n \in \N_+\) and \(m \in \N\) with \(m \le n\) then

- \(\cov(Y_m, Y_n) = m \sigma^2\)
- \(\cor(Y_m, Y_n) = \sqrt{\frac{m}{n}}\)
- \(\E(Y_m Y_n) = m \sigma^2 + m n \mu^2\)

- Note that \(Y_n = Y_m + (Y_n - Y_m)\). This follows from basic properties of covariance and the stationary and independence properties: \[ \cov(Y_m, Y_n) = \cov(Y_m, Y_m) + \cov(Y_m, Y_n - Y_m) = \var(Y_m) + 0 = m \sigma^2 \]
- This result follows from part (a) and the result above for mean and variance \[ \cor(Y_m, Y_m) = \frac{\cov(Y_m, Y_n)}{\sd(Y_m) \sd(Y_n)} = \frac{m \sigma^2}{\sqrt{m \sigma^2} \sqrt{n \sigma^2}} = \sqrt{\frac{m}{n}} \]
- This result also follows from part (a) and the result above for mean and variance: \(\E(Y_m Y_n) = \cov(Y_m, Y_n) + \E(Y_m) \E(Y_n) = m \sigma^2 + m \mu n \mu\)

If \(X\) has moment generating function \(G\) then \(Y_n\) has moment generating function \(G^n\).

This follows from a basic property of generating functions: the generating function of a sum of independent variables is the product of the generating functions of the terms.

Suppose that \(X\) has either a discrete distribution or a continuous distribution with probability density function \(f\). Then the probability density function of \(Y_n\) is \(f^{*n} = f * f * \cdots * f\), the convolution power of \(f\) of order \(n\).

This follows from a basic property of PDFs: the pdf of a sum of independent variables is the convolution of the PDFs of the terms.

More generally, we can use the stationary and independence properties to find the joint distributions of the partial sum process:

If \(n_1 \lt n_2 \lt \cdots \lt n_k\) then \((Y_{n_1}, Y_{n_2}, \ldots, Y_{n_k})\) has joint probability density function \[ f_{n_1, n_2, \ldots, n_k}(y_1, y_2, \ldots, y_k) = f^{*n_1}(y_1) f^{*(n_2 - n_1)}(y_2 - y_1) \cdots f^{*(n_k - n_{k-1})}(y_k - y_{k-1}), \quad (y_1, y_2, \ldots, y_k) \in \R^k \]

This follows from the multivariate change of variables theorem.

First, let's make the central limit theorem more precise. From Exercise 4, we cannot expect \(Y_n\) itself to have a limiting distribution. Note that \(\var(Y_n) \to \infty\) as \(n \to \infty\) since \(\sigma \gt 0\), and \(\E(Y_n) \to \infty\) as \(n \to \infty\) if \(\mu \gt 0\) while \(\E(Y_n) \to -\infty\) as \(n \to \infty\) if \(\mu \lt 0\). Similarly, we know that \(M_n \to \mu\) as \(n \to \infty\) with probability 1, so the limiting distribution of the sample mean is degenerate. Thus, to obtain a limiting distribution of \(Y_n\) or \(M_n\) that is not degenerate, we need to consider, not these variables themeselves, but rather the common standard score. Thus, let \[ Z_n = \frac{Y_n - n \mu}{\sqrt{n} \sigma} = \frac{M_n - \mu}{\sigma \big/ \sqrt{n}} \]

\(Z_n\) has mean 0 and variance 1.

- \(\E(Z_n) = 0\)
- \(\var(Z_n) = 1\)

These results follow from basic properties of expected value and variance, and are true for the standard score associated with any random variable. Recall also that the standard score of a variable is invariant under linear transformations with positive slope. The fact that the standard score of \(Y_n\) and the standard score of \(M_n\) are the same is a special case of this.

The central limit theorem states that the distribution of the standard score \(Z_n\) converges to the standard normal distribution as \(n \to \infty\). Recall that the standard normal distribution has probability density function
\[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}, \quad z \in \R \]
and is studied in more detail in the chapter on special distributions. A special case of the central limit theorem (to Bernoulli trials), dates to Abraham De Moivre. The term *central limit theorem* was coined by George Pólya in 1920. By definition of convergence in distribution, the central limit theorem states that \(F_n(z) \to \Phi(z)\) as \(n \to \infty\) for each \(z \in \R\), where \(F_n\) is the distribution function of \(Z_n\) and \(\Phi\) is the standard normal distribution function:

An equivalent statment of the central limit theorm involves convergence of the corresponding characteristic functions. This is the version that we will give and prove, but first we need a generalization of a famous limit from calculus.

Suppose that \((a_1, a_2, \ldots)\) is a sequence of real numbers and that \(a_n \to a\) as \(n \to \infty\). Then \[ \left( 1 + \frac{a_n}{n} \right)^n \to e^a \text{ as } n \to \infty \]

Now let \(\chi\) denote the characteristic function of the standard score of the sample variable \(X\), and let \(\chi_n\) denote the characteristic function of the standard score \(Z_n\): \[ \chi(t) = \E \left[ \exp\left( i t \frac{X - \mu}{\sigma} \right) \right], \; \chi_n(t) = \E[\exp(i t Z_n)]; \quad t \in \R \] Recall that \(t \mapsto e^{-\frac{1}{2}t^2}\) is the characteristic function of the standard normal distribution. We can now prove the central limit theorem:

The distribution of \(Z_n\) converges to the standard normal distribution as \(n \to \infty\). That is, \(\chi_n(t) \to e^{-\frac{1}{2}t^2}\) as \(n \to \infty\) for each \(t \in \R\).

Note that \(\chi(0) = 1\), \(\chi^\prime(0) = 0\), \(\chi^{\prime \prime}(0) = -1\). Next \[ Z_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n \frac{X_i - \mu}{\sigma} \] From properties of characteristic functions, \(\chi_n(t) = \chi^n (t / \sqrt{n})\) for \(t \in \R\). By Taylor's theorem (named after Brook Taylor), \[ \chi\left(\frac{t}{\sqrt{n}}\right) = 1 + \frac{1}{2} \chi^{\prime\prime}(s_n) \frac{t^2}{n} \text{ where } \left|s_n\right| \le \frac{\left|t\right|}{n} \] But \(s_n \to 0\) and hence \(\chi^{\prime\prime}(s_n) \to -1\) as \(n \to \infty\). Finally, \[ \chi_n(t) = \left[1 + \frac{1}{2} \chi^{\prime\prime}(s_n) \frac{t^2}{n} \right]^n \to e^{-\frac{1}{2} t^2} \text{ as } n \to \infty \]

The central limit theorem implies that if the sample size \(n\) is large

then the distribution of the partial sum \(Y_n\) is approximately normal with mean \(n \mu\) and variance \(n \sigma^2\). Equivalently the sample mean \(M_n\) is approximately normal with mean \(\mu\) and variance \(\sigma^2 / n\). The central limit theorem is of fundamental importance, because it means that we can approximate the distribution of certain statistics, even if we know very little about the underlying sampling distribution.

Of course, the term large

is relative. Roughly, the more abnormal

the basic distribution, the larger \(n\) must be for normal approximations to work well. The rule of thumb is that a sample size \(n\) of at least 30 will usually suffice if the basic distribution is not too weird; although for many distributions smaller \(n\) will do.

Let \(Y\) denote the sum of the variables in a random sample of size 30 from the uniform distribution on \([0, 1]\). Find normal approximations to each of the following:

- \(\P(13 \lt Y \lt 18)\)
- The 90th percentile of \(Y\)

- 0.8682
- 17.03

Random variable \(Y\) in the previous exercise has the Irwin-Hall distribution of order 30. The Irwin-Hall distributions are studied in more detail in the chapter on Special Distributions and are named for Joseph Irwin and Phillip Hall.

In the special distribution simulator, select the Irwin-Hall distribution. Vary and \(n\) from 1 to 10 and note the shape of the probability density function. With \(n = 10\) run the experiment 1000 times and compare the empirical density function to the true probability density function.

Let \(M\) denote the sample mean of a random sample of size 50 from the distribution with probability density function \(f(x) = \frac{3}{x^4}\) for \(1 \le x \lt \infty\). This is a Pareto distribution, named for Vilfredo Pareto. Find normal approximations to each of the following:

- \(\P(M \gt 1.6)\)
- The 60th percentile of \(M\)

- 0.2071
- 1.531

A slight technical problem arises when the sampling distribution is discrete. In this case, the partial sum also has a discrete distribution, and hence we are approximating a discrete distribution with a continuous one. Suppose that \(X\) takes integer values (the most common case) and hence so does the partial sum \(Y_n\). For any \(k \in \Z\) and \(h \in [0, 1)\), note that the event \(\{k - h \le Y_n \le k + h\}\) is equivalent to the event \(\{Y = k\}\). Different values of \(h\) lead to different normal approximations, even though the events are equivalent. The smallest approximation would be 0 when \(h = 0\), and the approximations increase as \(h\) increases. It is customary to split the difference by using \(h = \frac{1}{2}\) for the normal approximation. This is sometimes called the continuity correction or the histogram correction. The continuity correction is extended to other events in the natural way, using the additivity of probability. If \(j, k \in \Z\) with \(j \le k\) then

- \(\{j \le Y_n \le k\} = \{j - 1 \lt Y_n \lt k + 1\}\). Use \(\{j - \frac{1}{2} \le Y_n \le k + \frac{1}{2}\}\) in the normal approximation.
- \(\{j \le Y_n\} = \{j - 1 \lt Y_n\}\). Use \(\{j - \frac{1}{2} \le Y_n\}\) in the normal approximation.
- \(\{Y_n \le k\} = \{Y_n \lt k + 1\}\). Use \(\{Y_n \le k + \frac{1}{2}\}\) in the normal approximation.

Let \(Y\) denote the sum of the scores of 20 fair dice. Compute the normal approximation to \(\P(60 \le Y \le 75)\).

0.6741

In the dice experiment, set the die distribution to fair, select the sum random variable \(Y\), and set \(n = 20\). Run the simulation 1000 times and find each of the following. Compare with the result in the previous exercise:

- \(\P(60 \le Y \le 75)\)
- The relative frequency of the event \(\{60 \le Y \le 75\}\) (from the simulation)

Recall that the gamma distribution with shape parameter \(k \gt 0\) and scale parameter \(b \gt 0\) has probability density function \[ f(x) = \frac{1}{\Gamma(k) b^k} x^{k-1} e^{-x/b}, \quad 0 \lt x \lt \infty \] The gamma distribution is widely used to model random times (particularly in the context of the Poisson model) and other positive random variables. The general gamma distribution is studied in more detail in the chapter on Special Distributions. In the context of the Poisson model (where \(k \in \N_+\)), the gamma distribution is also known as the Erlang distribution; it is studied in more detail in the chapter on the Poisson Process. Suppose now that \(Y_k\) has the gamma distribution with shape parameter \(k \in \N_+\) and scale parameter \(b \gt 0\) then \[ Y_k = \sum_{i=1}^k X_i \] where \((X_1, X_2, \ldots)\) is a sequence of independent variables, each having the exponential distribution with scale parameter \(b\). (The exponential distribution is a special case of the gamma distribution with shape parameter 1.) Since \(\E(X_i) = b\) and \(\var(X_i) = b^2\), it follows that if \(k\) is large, the gamma distribution can be approximated by the normal distribution with mean \(k b\) and variance \(k b^2\). The same statement actually holds when \(k\) is not an integer; more precisely, the distribution of the standardized variable below converges to the standard normal distribution as \(k \to \infty\): \[ Z_k = \frac{Y_k - k b}{\sqrt{k} b} \]

In the special distribution simulator, select the gamma distribution. Vary and \(b\) and note the shape of the probability density function. With \(k = 10\) and various values of \(b\), run the experiment 1000 times and compare the empirical density function to the true probability density function.

Suppose that \(Y\) has the gamma distribution with shape parameter \(k = 10\) and scale parameter \(b = 2\). Find normal approximations to each of the following:

- \( \P(18 \le Y \le 23) \)
- The 80th percentile of \(Y\)

- 0.3063
- 25.32

Recall that the chi-square distribution with \(n \in \N_+\) degrees of freedom is a special case of the gamma distribution, with shape parameter \(k = n / 2\) and scale parameter \(b = 2\). Thus, the chi-square distribution with \(n\) degrees of freedom has probability density function \[ f(x) = \frac{1}{\Gamma(n/2) 2^{n/2}} x^{n/2 - 1}e^{-x/2}, \quad 0 \lt x \lt \infty \] The chi-square distribution is one of the most important distributions in statistics, because it governs sums of squares of independent standard normal variables. The chi-square distribution is studied in more detail in the chapter on Special Distributions. From the previous gamma distribution discussion, it follows that if \(n\) is large, the chi-square distribution can be approximated by the normal distribution with mean \(n\) and variance \(2 n\). More precisely, if \(Y_n\) has the chi-square distribution with \(n\) degrees of freedom, then the distribution of the standardized variable below converges to the standard normal distribution as \(n \to \infty\): \[ Z_n = \frac{Y_n - n}{\sqrt{2 n}} \]

In the special distribution simulator, select the chi-square distribution. Vary \(n\) and note the shape of the probability density function. With \(n = 20\), run the experiment 1000 times andcompare the empirical density function to the probability density function.

Suppose that \(Y\) has the chi-square distribution with \(n = 20\) degrees of freedom. Find normal approximations to each of the following:

- \(\P(18 \lt Y \lt 25)\)
- The 75th percentile of \(Y\)

- 0.4107
- 24.3

Recall that a Bernoulli trials sequence, named for Jacob Bernoulli, is a sequence \( (X_1, X_2, \ldots) \) of independent, identically distributed indicator variables with \(\P(X_i = 1) = p\) for each \(i\), where \(p \in (0, 1)\) is the parameter. In the usual language of reliability, \(X_i\) is the outcome of trial \(i\), where 1 means success and 0 means failure. The common mean is \(p\) and the common variance is \(p (1 - p)\).

Let \(Y_n = \sum_{i=1}^n X_i\), so that \(Y_n\) is the number of successes in the first \(n\) trials. Recall that \(Y_n\) has the binomial distribution with parameters \(n\) and \(p\), and has probability density function \[ f(k) = \binom{n}{k} p^k (1 - p)^{n-k}, \quad k \in \{0, 1, \ldots, n\} \] The binomial distribution is studied in more detail in the chapter on Bernoulli trials.

It follows from the central limit theorem that if \(n\) is large, the binomial distribution with parameters \(n\) and \(p\) can be approximated by the normal distribution with mean \(n p\) and variance \(n p (1 - p)\). The rule of thumb is that \(n\) should be large enough for \(n p \ge 5\) and \(n (1 - p) \ge 5\). (The first condition is the important one when \(p \lt \frac{1}{2}\) and the second condition is the important one when \(p \gt \frac{1}{2}\).) More precisely, the distribution of the standardized variable \(Z_n\) given below converges to the standard normal distribution as \(n \to \infty\): \[ Z_n = \frac{Y_n - n p}{\sqrt{n p (1 - p)}} \]

In the binomial timeline experiment, vary \(n\) and \(p\) and note the shape of the probability density function. With \(n = 50\) and \(p = 0.3\), run the simulation 1000 times and compute the following:

- \(\P(12 \le Y \le 16)\)
- The relative frequency of the event \(\{12 \le Y \le 16\}\) (from the simulation)

- 0.5448

Suppose that \(Y\) has the binomial distribution with parameters \(n = 50\) and \(p = 0.3\). Compute the normal approximation to \( \P(12 \le Y \le 16) \) (don't forget the continuity correction) and compare with the results of the previous exercise.

0.5383

Recall that the Poisson distribution, named for Simeon Poisson, has probability density function
\[ f(x) = e^{-\theta} \frac{\theta^x}{x!}, \quad x \in \N \]
where \(\theta \gt 0\) is a parameter. The parameter is both the mean and the variance of the distribution. The Poisson distribution is widely used to model the number of random points

in a region of time or space, and is studied in more detail in the chapter on the Poisson Process. In this context, the parameter is proportional to the size of the region.

Suppose now that \(Y_n\) has the Poisson distribution with parameter \(n \in \N_+\). Then \[ Y_n = \sum_{i=1}^n X_i \] where \((X_1, X_2, \ldots, X_n)\) is a sequence of independent variables, each with the Poisson distribution with parameter 1. It follows from the central limit theorem that if \(n\) is large, the Poisson distribution with parameter \(n\) can be approximated by the normal distribution with mean \(n\) and variance \(n\). The same statement holds when the parameter \(n\) is not an integer; more precisely, the distribution of the standardized variable below converges to the standard normal distribution as \(n \to \infty\):

\[ Z_n = \frac{Y_n - n}{\sqrt{n}} \]Suppose that \(Y\) has the Poisson distribution with mean 20.

- Compute the true value of \(\P(16 \le Y \le 23)\).
- Compute the normal approximation to \(\P(16 \le Y \le 23)\).

- 0.6310
- 0.6259

In the Poisson experiment, vary the time and rate parameters \(t\) and \(r\) (the parameter of the Poisson distribution in the experiment is the product \(r t\)). Note the shape of the probability density function. With \(r = 5\) and \(t = 4\), run the experiment 1000 times and compare the empirical density function to the true probability density function.

Recall the discussion of Bernoulli trials with success parameter \( p \in (0, 1) \) given above. For \(k \in \N_+\), the trial number of the \(k\)th success has the negative binomial distribution with parameters \(k\) and \(p\), and has probability density function \[ f(n) = \binom{n-1}{k-1} p^k (1 - p)^{n-k}, \quad n \in \{k, k + 1, k + 2, \ldots\} \] The negative binomial distribution is studied in more detail in the chapter on Bernoulli trials.

Suppose now that \(Y_k\) has the negative binomial distribution with trial parameter \(k\) and success parameter \(p\). Then \[ Y_k = \sum_{i=1}^k X_i \] where \((X_1, X_2, \ldots, X_k)\) is a sequence of independent variables, each having the geometric distribution on \(\N_+\) with parameter \(p\). (The geometric distribution is a special case of the negative binomial, with parameters 1 and \(p\).) In the context of the Bernoulli trials, \(X_i\) is the number of trials needed to go from the \((i - 1)\)st success to the \(i\)th success. Since \(\E(X_i) = 1 / p\) and \(\var(X_i) = (1 - p) / p^2\), it follows that if \(k\) is large, the negative binomial distribution can be approximated by the normal distribution with mean \(k / p\) and variance \(k (1 - p) / p^2\). More precisely, the distribution of the standardized variable below converges to the standard normal distribution as \(k \to \infty\). \[ Z_k = \frac{p Y_k - k}{\sqrt{k (1 - p)}} \]

In the negative binomial experiment, vary \(k\) and \(p\) and note the shape of the probability density function. With \(k = 5\) and \(p = 0.4\), run the experiment 1000 times and compare the empirical density function to the true probability density function.

Suppose that \(Y\) has the negative binomial distribution with trial parameter \(k = 10\) and success parameter \(p = 0.4\). Find normal approximations to each of the following:

- \(\P(20 \lt Y \lt 30)\)
- The 80th percentile of \(Y\)

- 0.6318
- 30.1

Our last topic is a bit more esoteric, but still fits with the general setting of this section. Recall that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of independent, identically distributed real-valued random variables with common mean \(\mu\) and variance \(\sigma^2\). Suppose now that \(N\) is a random variable (on the same probability space) taking values in \(\N\), also with finite mean and variance. Then \[ Y_N = \sum_{i=1}^N X_i \] is a random sum of the independent, identically distributed variables. That is, the terms are random of course, but so also is the number of terms \(N\). We are primarily interested in the moments of \(Y_N\).

Suppose first that \(N\), the number of terms, is independent of \(\bs{X}\), the sequence of terms. Computing the moments of \(Y_N\) is a good exercise in conditional expectation.

The conditional expected value of \(Y_N\) given \(N\), and the expected value of \(Y_N\) are

- \(\E(Y_N \mid N) = N \mu\)
- \(\E(Y_N) = \E(N) \mu\)

The conditional variance of \(Y_N\) given \(N\) and the variance of \(Y_N\) are

- \(\var(Y_N \mid N) = N \sigma^2\)
- \(\var(Y_N) = \E(N) \sigma^2 + \var(N) \mu^2\)

Let \(H\) denote the probability generating function of \(N\). Show that the moment generating function of \(Y_N\) is \(H \circ G\).

- \(\E(e^{t Y_N} \mid N) = [G(t)]^N\)
- \(\E(e^{t Y_N}) = H(G(t))\)

The result in Exercise 29 (b) generalizes to the case where the random number of terms \(N\) is a stopping time for the sequence \(\bs{X}\). This means that the event \(\{N = n\}\) depends only on (technically, is measurable with respect to) \((X_1, X_2, \ldots, X_n)\) for each \(n \in \N\). The generalization is knowns as Wald's equation, and is named for Abraham Wald. Stopping times are studied in much more technical detail in the section on Filtrations and Stopping Times.

If \(N\) is a stopping time for \(\bs{X}\) then \(\E(Y_N) = \E(N) \mu\).

First note that \(Y_N = \sum_{i=1}^\infty X_i \bs{1}(i \le N)\). But \(\{i \le N\} = \{N \lt i\}^c\) depends only on \(\{X_1, \ldots, X_{i-1}\}\) and hence is independent of \(X_i\). Thus \(\E[X_i \bs{1}(i \le N)] = \mu \P(N \ge i)\). Suppose that \(X_i \ge 0\) for each \(i\). Taking expected values term by term gives Wald's equation in this special case. The interchange of sum and expected value is justified by the monotone convergence theorem. Now Wald's equation can be established in general by using the dominated convergence theorem.

Suppose that the number of customers arriving at a store during a given day has the Poisson distribution with parameter 50. Each customer, independently of the others (and independently of the number of customers), spends an amount of money that is uniformly distributed on the interval \([0, 20]\). Find the mean and standard deviation of the amount of money that the store takes in during a day.

500, 81.65

When a certain critical component in a system fails, it is immediately replaced by a new, statistically identical component. The components are independent, and the lifetime of each (in hours) is exponentially distributed with scale parameter \(b\). During the life of the system, the number of critical components used has a geometric distribution on \(\N_+\) with parameter \(p\). For the total life of the critical component,

- Find the mean.
- Find the standard deviation.
- Find the moment generating function.
- Identify the distribution by name.

- \(b / p\)
- \(b / p\)
- \(t \mapsto \frac{1}{1 - (b/p)t}\)
- Exponential distribution with scale parameter \(b / p\)