\(\renewcommand{\P}{\mathbb{P}}\)
\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\N}{\mathbb{N}}\)
\(\newcommand{\Q}{\mathbb{Q}}\)
\( \newcommand{\E}{\mathbb{E}} \)

Suppose that \(\left(X_1, X_2, \ldots\right)\) and \(X\) are real-valued random variables with distribution functions \(\left(F_1, F_2, \ldots\right)\) and \(F\), respectively. We say that the distribution of \(X_n\) converges to the distribution of \(X\) as \(n \to \infty\) if
\[ F_n(x) \to F(x) \text{ as } n \to \infty \]
for all \(x\) at which \(F\) is continuous. The first fact to notice is that convergence in distribution, as the name suggests, only involves the *distributions* of the random variables. Thus, the random variables need not even be defined on the same probability space (that is, they need not be defined for the same random experiment), and indeed we don't even need the random variables at all. This is in contrast to the other modes of convergence we have studied or will study:

We will show, in fact, that convergence in distribution is the weakest of all of these modes of convergence. However, strength of convergence should not be confused with importance. Convergence in distribution is one of the most important modes of convergence; the central limit theorem, one of the two fundamental theorems of probability, is a theorem about convergence in distribution.

The examples below show why the definition is given in terms of distribution functions, rather than probability density functions, and why convergence is only required at the points of continuity of the limiting distribution function. To understand the first example, note that if a deterministic sequence converges in the ordinary calculus sense, then naturally we want the sequence (thought of as random variables) to converge in distribution.

Let \(X_n = \frac{1}{n}\) for \(n \in \N_+\) and let \(X = 0\). Let \(f_n\) and \(f\) be the corresponding probability density functions and let \(F_n\) and \(F\) be the corresponding distribution functions. Then

- \(f_n(x) \to 0\) as \(n \to \infty\) for all \(x \in \R\)
- \(F_n(x) \to \begin{cases} 0 & x \le 0 \\ 1 & x \gt 0 \end{cases}\) as \(n \to \infty\)
- \(F_n(x) \to F(x)\) as \(n \to \infty\) for all \(x \ne 0\)

Note that \( f_n(x) = \begin{cases} 1 & x = \frac{1}{n} \\ 0 & x \ne \frac{1}{n} \end{cases}\) and \( F_n(x) = \begin{cases} 0 & x \lt \frac{1}{n} \\ 1 & x \ge \frac{1}{n} \end{cases} \), and \( F(x) = \begin{cases} 0 & x \lt 0 \\ 1 & x \ge 0 \end{cases} \).

For the example below, recall that \( \Q \) denotes the set of rational numbers.

Suppose that \(X_n\) has the discrete uniform distribution on \(\left\{\frac{1}{n}, \frac{2}{n}, \ldots \frac{n-1}{n}, 1\right\}\) for each \(n \in \N_+\), and let \( f_n \) denote the probability density function of \( X_n \). Let \(X\) have the continuous uniform distribution on the interval \([0, 1]\). Then

- The distribution of \(X_n\) converges to the distribution of \(X\) as \(n \to \infty\).
- \(\P\left(X_n \in \Q\right) = 1\) for each \(n\) but \(\P(X \in \Q) = 0\).
- \(f_n(x) \to 0\) as \(n \to \infty\) for all \(x \in [0, 1]\).

For part (a), note that the CDF \( F_n \) of \( X_n \) is given by \( F_n(x) = \lfloor n \, x \rfloor / n \) for \( x \in [0, 1] \). But \( n \, x - 1 \le \lfloor n \, x \rfloor \le n \, x \) so \( \lfloor n \, x \rfloor / n \to x \) as \( n \to \infty \). For part (b), note that \( X_n \) takes rational values by definition, while \( X \) has a continuous distribution and \( \Q \) is countable. For part (c) note that \( \left|f_n(x)\right| \le \frac{1}{n} \) for every \( x \in \R \).

As Example 2 shows, it is quite possible to have a sequence of discrete distributions converge to a continuous distribution (or the other way around). Recall that probability density functions have very different meanings in the discrete and continuous cases: density with respect to counting measure in the first case, and density with respect to Lebesgue measure in the second case. This is another indication that distribution functions, rather than density functions, are the correct objects of study. However, if probability density functions of a fixed type converge then the distributions converge. The following results are a consequence of Scheffe's theorem, which is given in advanced topics below.

Suppose that \(\left(f_1, f_2, \ldots \right)\) and \(f\) are probability density functions for discrete distributions on a countable set \(S\), and that \(f_n(x) \to f(x)\) as \(n \to \infty\) for each \(x \in S\). Then the distribution defined by \(f_n\) converges to the distribution defined by \(f\) as \(n \to \infty\). Similarly, suppose that \(\left(f_1, f_2, \ldots \right)\) and \(f\) are probability density functions for continuous distributions on \(\R\), and that \(f_n(x) \to f(x)\) as \(n \to \infty\) for all \(x \in \R\) (except perhaps on a set with Lebesgue measure 0). Then the distribution defined by \(f_n\) converges to the distribution defined by \(f\) as \(n \to \infty\).

Suppose that \((X_1, X_2, \ldots)\) and \(X\) are random variables (defined on the same probability space) with distribution functions \((F_1, F_2, \ldots)\) and \(F\), respectively. If \(X_n \to X\) as \(n \to \infty\) in probability, then the distribution of \(X_n\) converges to the distribution of \(X\) as \(n \to \infty\).

Fix \(\epsilon \gt 0\). Note first that \(\P(X_n \le x) = \P(X_n \le x, X \le x + \epsilon) + \P(X_n \le x, X \gt x + \epsilon) \). Hence \(F_n(x) \le F(x + \epsilon) + \P\left(\left|X_n - X\right| \gt \epsilon\right)\). Next, note that \(\P(X \le x - \epsilon) = \P(X \le x - \epsilon, X_n \le x) + \P(X \le x - \epsilon, X_n \gt x)\). Hence \(F(x - \epsilon) \le F_n(x) + \P\left(\left|X_n - X\right|\right) \gt \epsilon\). From the last two results it follows that \[ F(x - \epsilon) - \P\left(\left|X_n - X\right| \gt \epsilon\right) \le F_n(x) \le F(x + \epsilon) + \P\left(\left|X_n - X\right| \gt \epsilon\right) \] Letting \(n \to \infty\) and using convergence in probability gives \[ F(x - \epsilon) \le \liminf_{n \to \infty} F_n(x) \le \limsup_{n \to \infty} F_n(x) \le F(x + \epsilon) \] Finally, letting \(\epsilon \downarrow 0\) we see that if \(F\) is continuous at \(x\) then \(F_n(x) \to F(x)\) as \(n \to \infty\).

Our next example shows that even when the variables are defined on the same probability space, a sequence can converge in distribution, but not in any other way.

Let \(X\) be an indicator variable with \(\P(X = 0) = \P(X = 1) = \frac{1}{2}\), so that \(X\) is the result of tossing a fair coin. Let \(X_n = 1 - X \) for \(n \in \N_+\). Then

- \(1 - X\) has the same distribution as \(X\).
- The distribution of \(X_n\) converges to the distribution of \(X\) as \(n \to \infty\).
- \(\left|X_n - X\right| = 1\) for every \(n \in \N_+\).
- \(\P(X_n \text{ does not converge to } X \text { as } n \to \infty) = 1\).
- \(\P\left(\left|X_n - X\right| \gt \frac{1}{2}\right) = 1\) for each \(n \in \N_+\) so \(X_n \) does not converge to \( X \) as \(n \to \infty\) in probability.
- \(\E\left(\left|X_n - X\right|\right) = 1\) for each \(n \in \N_+\) so \(X_n\) does not converge to \(X\) as \(n \to \infty\) in mean.

The critical fact that makes this counterexample work is part (a): \(1 - X\) has the same distribution as \(X\). Any random variable with this property would work just as well, so if you prefer a counterexample with continuous distributions, let \(X\) have probability density function \(f\) given by \(f(x) = 6 x (1 - x)\) for \(0 \le x \le 1\).

To summarize, we have the following implications for the various modes of convergence; no other implications hold in general.

- Convergence with probability 1 implies convergence in probability.
- Convergence in mean implies convergence in probability.
- Convergence in probability implies convergence in distribution.

It follows that convergence with probability 1, convergence in probability, and convergence in mean all imply convergence in distribution, so the latter mode of convergence is indeed the weakest. However, the following exercise gives an important converse to the last implication in the list above, when the limiting variable is a constant. Of course, a constant can be viewed as a random variable defined on any probability space.

Suppose that \((X_1, X_2, \ldots)\) is a sequence of random variables (defined on the same probability space) and that the distribution of \(X_n\) converges to the distribution of the constant \(c\) as \(n \to \infty\). Then \(X_n \to c\) as \(n \to \infty\) in probability:

Note first that \(\P(X_n \le x) \to 0\) as \(n \to \infty\) if \(x \lt c\) and \(\P(X_n \le x) \to 1\) as \(n \to \infty\) if \(x \gt c\). It follows that \(\P\left(\left|X_n - c\right| \le \epsilon\right) \to 1\) as \(n \to \infty\) for every \(\epsilon \gt 0\).

There are several important cases where a special distribution converges to another special distribution as a parameter approaches a limiting value. Indeed, such convergence results are part of the reason why such distributions are *special* in the first place.

Recall that the hypergeometric distribution with parameters \(m\), \(r\), and \(n\) is the distribution that governs the number of type 1 objects in a sample of size \(n\), drawn without replacement from a population of \(m\) objects with \(r\) of type 1. It has discrete probability density function \[ f(k) = \frac{\binom{r}{k} \binom{m - r}{n - k}}{\binom{m}{n}}, \quad k \in \{0, 1, \ldots, n\} \] The pramaters \(m\), \(r\), and \(n\) are positive integers with \(n \le m\) and \(r \le m\). The hypergeometric distribution is studied in more detail in the chapter on Finite Sampling Models

Recall also that the binomial distribution with parameters \(n \in \N_+\) and \(p \in [0, 1]\) is the distribution of the number successes in \(n\) Bernoulli trials, when \(p\) is the probability of success on a trial. This distribution has probability density function
\[ g(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k \in \{0, 1, \ldots, n\} \]
The binomial distribution is studied in more detail in the chapter on Bernoulli Trials. Note that the binomial distribution with parameters \(n\) and \(p = r / m\) is the distribution that governs the number of type 1 objects in a sample of size \(n\), drawn *with* replacement from a population of \(m\) objects with \(r\) of type 1.

Suppose that \(r_m\) depends on \(m\), and that \(r_m \big/ m \to p\) as \(m \to \infty\). For fixed \(n\), the hypergeometric distribution with parameters \(m\), \(r_m\), and \(n\) converges to the binomial distribution with parameters \(n\) and \(p\) as \(m \to \infty\).

Recall that for \( a \in \R \) and \( j \in \N \), we let \( a^{(j)} = a \, (a - 1) \cdots [a - (j - 1)] \) denote the falling power of \( a \) of order \( j \). The hypergeometric PDF can be written as \[ f_m(k) = \binom{n}{k} \frac{r_m^{(k)} (m - r_m)^{(n - k)}}{m^{(n)}}, \quad k \in \{0, 1, \ldots, n\} \] In the fraction above, the numerator and denominator both have \( n \) fractors. Suppose that we group the \( k \) factors in \( r_m^{(k)} \) with the first \( k \) factors of \( m^{(n)} \) and the \( n - k \) factors of \( (m - r_m)^{(n-k)} \) with the last \( n - k \) factors of \( m^{(n)} \) to form a product of \( n \) fractions. The first \( k \) fractions have the form \( (r_m - j) \big/ (m - j) \) for some \( j \) that does not depend on \( m \). Each of these converges to \( p \) as \( m \to \infty \). The last \( n - k \) fractions have the form \( (m - r_m - j) \big/ (m - k - j) \) for some \( j \) that does not depend on \( m \). Each of these converges to \( 1 - p \) as \( n \to \infty \).

From a practical point of view, the last result means that if the population size \(m\) is large

compared to sample size \(n\), then the hypergeometric distribution with parameters \(m\), \(r\), and \(n\) (which corresponds to sampling without replacement) is well approximated by the binomial distribution with parameters \(n\) and \(p = r / m\) (which corresponds to sampling with replacement). This is often a useful result, because the binomial distribution has fewer parameters than the hypergeometric distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the limiting binomial distribution, we do not need to know the population size \(m\) and the number of type 1 objects \(r\) *individually*, but only in the *ratio* \(r / m\).

In the ball and urn experiment, set \(m = 100\) and \(r = 30\). For each of the following values of \(n\) (the sample size), switch between *sampling without replacement* (the hypergeometric distribution) and *sampling with replacement* (the binomial distribution). Note the difference in the probability density functions. Run the simulation 1000 times for each sampling mode and note the agreement between the relative frequency function and the probability density function.

- 10
- 20
- 30
- 40
- 50

Recall again that the binomial distribution with parameters \(n \in \N_+\) and \(p \in [0, 1]\) is the distribution of the number successes in \(n\) Bernoulli trials, when \(p\) is the probability of success on a trial. This distribution has probability density function
\[ f(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k \in \{0, 1, \ldots, n\} \]
Recall also that the Poisson distribution with parameter \(r \gt 0\) has probability density function
\[g(k) = e^{-r} \frac{r^k}{k!}, \quad k \in \N\]
The distribution is named for Simeon Poisson and governs the number of random points

in a region of time or space, under certain ideal conditions. The parameter \(r\) is proportional to the size of the region of time or space. The Poisson distribution is studied in more detail in the chapter on the Poisson Process.

Suppose now that \(p_n\) depends on \(n\) and that \(n p_n \to r \gt 0\) as \(n \to \infty\). The binomial distribution with parameters \(n\) and \(p_n\) converges to the Poisson distribution with parameter \(r\) as \(n \to \infty\).

For \( k, \, n \in \N \) with \( k \le n \), the binomial PDF can be written as \[ f_n(k) = \frac{n^{(k)}}{k!} p_n^k (1 - p_n)^{n - k} = (n p_n) [(n - 1) p_n] \cdots [(n - k + 1) p_n] \left(1 - \frac{n p_n}{n} \right)^{n - k} \] Each factor of the form \( (n - j) p_n \) converges to \( r \) as \( n \to \infty \). By a famous limit from calculus, \( \left(1 - n p_n \big/ n\right)^n \to e^{-r} \) as \( n \to \infty \).

From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials \(n\) is large

and the probability of success \(p\) small

, so that \(n p^2\) is small, then the binomial distribution with parameters \(n\) and \(p\) is well approximated by the Poisson distribution with parameter \(r = n p\). This is often a useful result, because the Poisson distribution has fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the approximating Poisson distribution, we do not need to know the number of trials \(n\) and the probability of success \(p\) *individually*, but only in the *product* \(n p\). As we will see in the next chapter, the condition that \(n p^2\) be small means that the variance of the binomial distribution, namely \(n p (1 - p) = n p - n p^2\) is approximately \(r = n p\), which is the variance of the approximating Poisson distribution.

In the binomial timeline experiment, set the parameter values as follows, and observe the graph of the probability density function. (Note that \(n p = 5\) in each case.) Run the experiment 1000 times in each case and note the agreement between the relative frequency function and the probability density function. Note also the successes represented as random points

in discrete time.

- \(n = 10\), \(p = 0.5\)
- \(n = 20\), \(p = 0.25\)
- \(n = 100\), \(p = 0.05\)

In the Poisson experiment, set \(r = 5\) and \(t = 1\), to get the Poisson distribution with parameter 5. Note the shape of the probability density function. Run the experiment 1000 times and observe the agreement between the relative frequency function and the probability density function. Note the similarity between this experiment and the one in the previous exercise.

Recall that the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1]\) has probability density function \[ f(k) = p (1 - p)^{k-1}, \quad k \in \N_+\] The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.

Suppose that \(U\) has the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1]\). For \( n \in \N_+ \), the conditional distribution of \( U \) given \( U \le n \) converges to the uniform distribution on \(\{1, 2, \ldots, n\}\) as \(p \downarrow 0\).

The CDF of \( U \) if \( F(k) = 1 - (1 - p)^k \). Hence the conditional PDF of \( U \) given \( U \le n \) is \[ F_n(k) = \P(U \le k \mid U \le n) = \frac{\P(U \le k)}{\P(U \le n)} = \frac{1 - (1 - p)^k}{1 - (1 - p)^n}, \quad k \in \{1, 2, \ldots n\} \] Letting \( p \downarrow 0 \) and using L'Hospital's rule, gives \( F_n(k) \to k / n \) as \( p \downarrow 0 \), which is the CDF of the uniform distribution on \( \{1, 2, \ldots, n\} \).

Next, recall that the exponential distribution with rate parameter \(r \gt 0\) has distribution function
\[ G(t) = 1 - e^{-r t}, \quad 0 \le t \lt \infty \]
The exponential distribution governs the time between arrivals

in the Poisson model of random points in time.

Suppose that \(U_n\) has the geometric distribution on \(\N_+\) with success parameter \(p_n \in (0, 1]\) for each \(n \in \N_+\). Moreover, suppose that \(n p_n \to r \gt 0\) as \(n \to \infty\). The distribution of \(U_n / n\) converges to the exponential distribution with parameter \(r\) as \(n \to \infty\).

Let \( F_n \) denote the CDF of \( U_n / n \). Then for \( x \in [0, \infty) \) \[ F_n(x) = \P\left(\frac{U_n}{n} \le x\right) = \P(U_n \le n x) = \P\left(U_n \le \lfloor n x \rfloor\right) = 1 - (1 - p)^{\lfloor n x \rfloor} \] But we showed in the proof of Theorem 8 that \( (1 - p_n)^n \to e^{-r} \) as \( n \to \infty \). Hence \( F_n(x) \to 1 - e^{-r x} \) as \( n \to \infty \), which is the CDF of the exponential distribution.

Note that the limiting condition on \(n\) and \(p\) in the last exercise is precisely the same as the condition for the convergence of the binomial distribution to the Poisson discussed above. For a deeper interpretation of both of these results, see the section on the Bernoulli trials and the Poisson process.

In the negative binomial experiment, set \(k = 1\) to get the geometric distribution. Then decrease the value of \(p\) and note the shape of the probability density function. With \(p = 0.5\) run the experiment 1000 times and note the agreement between the relative frequency function and the probability density function.

In the gamma experiment, set \(k = 1\) to get the exponential distribution, and set \(r = 5\). Note the shape of the probability density function. Run the experiment 1000 times and note the agreement between the empirical density function and the probability density function. Compare this experiment with the one in the previous exercise, and note the similarity, up to a change in scale.

Consider a random permutation \((X_1, X_2, \ldots, X_n)\) of the elements in the set \(\{1, 2, \ldots, n\}\). We say that a match occurs at position \(i\) if \(X_i = i\).

\(\P\left(X_i = i\right) = \frac{1}{n}\) for each \(i \in \{1, 2, \ldots, n\}\).

Thus, the matching events all have the same probability, which varies inversely with the number of trials.

\(\P\left(X_i = i, X_j = j\right) = \frac{1}{n (n - 1)}\) for \(i, \; j \in \{1, 2, \ldots, n\}\) with \(i \ne j\).

Thus, the matching events are dependent, and in fact are positively correlated. In particular, the matching events do not form a sequence of Bernoulli trials. The matching problem is studied in detail in the chapter on Finite Sampling Models. In particular, the number of matches \(N_n\) has the following probability density function: \[ f_n(k) = \frac{1}{k!} \sum_{j=0}^{n-k} \frac{(-1)^j}{j!}, \quad k \in \{0, 1, \ldots, n\} \]

The distribution of \(N_n\) converges to the Poisson distribution with parameter 1 as \(n \to \infty\).

For \( k \in \N \), \[ f_n(k) = \frac{1}{k!} \sum_{j=0}^{n-k} \frac{(-1)^j}{j!} \to \frac{1}{k!} \sum_{j=0}^\infty \frac{(-1)^j}{j!} = \frac{1}{k!} e^{-1} \] which is the PDF of the Poisson distribution with parameter 1.

In the matching experiment, increase \(n\) and note the apparent convergence of the probability density function for the number of matches. With selected values of \(n\), run the experiment 1000 times and note the agreement between the relative frequency function and the probability density function.

Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent random variables, each with the standard exponential distribution. Thus, recall that the common distribution function is \[ G(x) = 1 - e^{-x}, \quad 0 \le x \lt \infty \]

The distribution of \(Y_n = \max\{X_1, X_2, \ldots, X_n\} - \ln(n) \) converges to the distribution with the following distribution function as \(n \to \infty\): \[ F(x) = e^{-e^{-x}}, \quad x \in \R\]

Let \( X_{(n)} = \max\{X_1, X_2, \ldots, X_n\} \) and recall that \( X_{(n)} \) has CDF \( G^n \). Let \( F_n \) denote the CDF of \( Y_n \). For \( x \in \R \) \[ F_n(x) = \P(Y_n \le x) = \P\left[X_{(n)} \le x + \ln(n)\right] = G^n[x + \ln(n)] = \left[1 - e^{-(x + \ln(n)}\right]^n = \left(1 - \frac{e^{-x}}{n} \right)^n \] By our famous limit from calculus again, \( F_n(x) \to e^{-e^{-x}} \) as \( n \to \infty \).

The limiting distribution in the last exercise is the type 1 extreme value distribution, also known as the Gumbel distribution in honor of Emil Gumbel. Extreme value distributions are studied in detail in the chapter on Special Distributions.

Recall that the Pareto distribution with shape parameter \(a \gt 0\) has distribution function \[F(x) = 1 - \frac{1}{x^a}, \quad 1 \le x \lt \infty\] The Pareto distribution is named for Vilfredo Pareto and is studied in more detail in the chapter on Special Distributions.

Suppose that \(X_n\) has the Pareto distribution with parameter \(n\) for each \(n \in \N_+\). Then

- The distribution of \(X_n\) converges to the distribution of the constant 1 as \(n \to \infty\).
- The distribution of \(Y_n = nX_n - n\) converges to the standard exponential distribution as \(n \to \infty\).

- The CDF of \( X_n \) is \( F_n(x) = 1 - 1 / x^n \) for \( x \ge 1 \). Hence \( F_n(x) = 0 \) for \( n \in \N_+ \) and \( x \le 1 \) while \( F_n(x) \to 1 \) as \( n \to \infty \) for \( x \gt 1 \). Thus the limit of \( F_n \) agrees with the CDF of the constant 1, except at 1, the point of discontinuity.
- Let \( G_n \) denote the CDF of \( Y_n \). For \( x \ge 0 \), \[ G_n(x) = \P(Y_n \le x) = \P(X_n \le 1 + x / n) = 1 - \frac{1}{(1 + x / n)^n} \] By our famous theorem from calculus again, it follows that \( G_n(x) \to 1 - 1 / e^x = 1 - e^{-x} \) as \( n \to \infty \), which is the CDF of the standard exponential distribution.

The two fundamental theorems of basic probability theory, the law of large numbers and the central limit theorem, are studied in detail in the chapter on Random Samples. For this reason we will simply state the results in this section.

Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent, identically distributed, real-valued random variables (defined on the same probability space) with mean \(\mu \in (-\infty. \infty)\) and standard deviation \(\sigma \in (0, \infty)\). Let \[ Y_n = \sum_{i=1}^n X_i \] denote the sum of the first \(n\) variables. A weak version of the law of large numbers states that the distribution of the average \( M_n = Y_n / n \) converges to the point mass distribution at \(\mu\) as \(n \to \infty\). From Theorem 5, the convergence is also in probability. In fact the convergence is with probability 1 (much stronger). On the other hand, the central limit theorem states that the distribution of the standard score \[ Z_n = \frac{Y_n - n \mu}{\sqrt{n} \, \sigma}\] converges to the standard normal distribution as \(n \to \infty\).

The theorem below is an important result known as the Skorohod representation theorem.

Suppose that \((F_1, F_2, \ldots)\) and \(F\) are distribution functions, and that \(F_n \to F\) as \(n \to \infty\) in the sense of convergence of distribution. Then there exist random variables \((X_1, X_2, \ldots)\) and \(X\) (defined on the same probability space) such that

- \(X_n\) has distribution function \(F_n\) for each \(n \in \N_+\).
- \(X\) has distribution \(F\),
- \(X_n \to X\) as \(n \to \infty\) with probability 1.

Let \(U\) be uniformly distributed on the interval \((0, 1)\). Define \(X_n = F_n^{-1}(U)\) and \(X = F^{-1}(U)\) where \(F_n^{-1}\) and \(F^{-1}\) are the quantile functions of \(F_n\) and \(F\) respectively. Recall that \(X_n\) has distribution function \(F_n\) for each \(n \in \N_+\), and \(X\) has distribution function \(F\). Let \(\epsilon \gt 0\) and let \(u \in (0, 1)\). Pick a continuity point \(x\) of \(F\) such that \(F^{-1}(u) - \epsilon \lt x \lt F^{-1}(u)\). Then \(F(x) \lt u\) and hence \(F_n(x) \lt u\) for \(n\) sufficiently large. It follows that \(F^{-1}(u) - \epsilon \lt x \lt F_n^{-1}(u)\) for \(n\) sufficiently large. Let \(n \to \infty\) and \(u \downarrow 0\) to conclude that \(F^{-1}(u) \le \liminf_{n \to \infty} F_n^{-1}(u)\). Next, let \(v\) satisfy \(0 \lt u \lt v \lt 1\) and let \(\epsilon \gt 0\). Pick a continuity point \(x\) of \(F\) such that \(F^{-1}(v) \lt x \lt F^{-1}(v) + \epsilon\). Then \(u \lt v \lt F(x)\) and hence \(u \lt F_n(x)\) for \(n\) sufficiently large. It follows that \(F_n^{-1}(u) \le x \lt F^{-1}(v) + \epsilon\) for \(n\) sufficiently large. Let \(n \to \infty\) and \(\epsilon \downarrow 0\) to conclude that \(\limsup_{n \to \infty} F_n^{-1}(u) \le F^{-1}(v)\). Letting \(v \downarrow u\) it follows that \(\limsup_{n \to \infty} F_n^{-1}(u) \le F^{-1}(u)\) if \(u\) is a point of continuity of \(F^{-1}\). Therefore \(F_n^{-1}(u) \to F^{-1}(u)\) as \(n \to \infty\) if \(u\) is a point of continuity of \(F^{-1}\). Recall from analysis that since \(F^{-1}(u)\) is increasing, the set \(D \subseteq (0, 1)\) of discontinuities of \(F^{-1}\) is countable. Since \( U \) has a continuous distribution, \(\P(U \in D) = 0\). Finally, it follows that \(\P(X_n \to X \text{ as } n \to \infty) = 1\).

The following result illustrates the value of the Skorohod representation.

Suppose that \((X_1, X_2, \ldots)\) and \(X\) are real-valued random variables (not necessarily defined on the same probability space) such that the distribution of \(X_n\) converges to the distribution of \(X\) as \(n \to \infty\). If \(g: \R \to \R\) is continuous, then the distribution of \(g(X_n)\) converges to the distribution of \(g(X)\) as \(n \to \infty\).

Let \((Y_1, Y_2, \ldots)\) and \(Y\) be random variables, defined on the same probability space, such that \(Y_n\) has the same distribution as \(X_n\) for each \(n \in \N_+\), \(Y\) has the same distribution as \(X\), and \(Y_n \to Y\) as \(n \to \infty\) with probability 1. Since \( g \) is continuous, \(g(Y_n) \to g(Y)\) as \(n \to \infty\) with probability 1. Hence the distribution of \(g(Y_n)\) converges to the distribution of \(g(Y)\) as \(n \to \infty\). But \(g(Y_n)\) has the same distribution as \(g(X_n)\) and that \(g(Y)\) has the same distribution as \(g(X)\).

In this subsection, we given an important result known as Scheffé's theorem, named after Henry Scheffé. To give the theorem in in full generality, so that it applies to discrete and continuous distribution, we need to use topics from other advanced sections in this chapter: the integral with respect to a positive measure, properties of the integral, and density functions. In turn, these sections depend on measure theory developed in the chapters on Foundations and Probability Measures.

To state our theorem, suppose that \( (S, \mathscr{S}, \mu) \) is a measure space, so that \( S \) is a set, \( \mathscr{S} \) is a \( \sigma \)-algebra of subsets of \( S \), and \( \mu \) is a positive measure on \( (S, \mathscr{S}) \). Further, suppose that \( P_n \) is a probability measure on \( (S, \mathscr{S}) \) that has density function \( f_n \) with respect to \( \mu \) for each \( n \in \N_+ \), and that \( P \) is a probability measure on \( (S, \mathscr{S}) \) that has density function \( f \) with respect to \( \mu \).

If \(f_n(x) \to f(x)\) as \(n \to \infty\) for almost all \( x \in S \) (with respect to \( \mu \)) then \(P_n(A) \to P(A)\) as \(n \to \infty\) uniformly in \(A \in \mathscr{S}\).

From basic properties of the integral it follows that for \( A \in \mathscr{S} \), \[\left|P(A) - P_n(A)\right| = \left|\int_A f \, d\mu - \int_A f_n \, d\mu \right| = \left| \int_A (f - f_n) \, d\mu\right| \le \int_A \left|f - f_n\right| \, d\mu \le \int_S \left|f - f_n\right| \, d\mu\] Let \(g_n = f - f_n\), and let \(g_n^+\) denote the positive part of \(g_n\) and \(g_n^-\) the negative part of \(g_n\). Note that \(g_n^+ \le f\) and \(g_n^+ \to 0\) as \(n \to \infty\) almost everywhere on \( S \). Since \( f \) is a probability density function, it is trivially integrable, so by the dominated convergence theorem, \(\int_S g_n^+ \, d\mu \to 0\) as \(n \to \infty\). But \(\int_\R g_n \, d\mu = 0\) so \(\int_\R g_n^+ \, d\mu = \int_\R g_n^- \, d\mu\). Therefore \(\int_S \left|g_n\right| \, d\mu = 2 \int_S g_n^+ d\mu \to 0\) as \(n \to \infty\). Hence \(P_n(A) \to P(A)\) as \(n \to \infty\) uniformly in \(A \in \mathscr{S}\).

Of course, the most important special cases of Scheffé's theorem are to discrete and continuous distributions:

- Suppose that \( S \) is a countable set, \( \mathscr{S} \) is the collection of all subsets of \( S \), and \( \mu = \# \) is counting measure on \( (S, \mathscr{S}) \). Then \( P_n \) and \( P \) are discrete distributions, and \( f_n \) and \( f \) are the ordinary discrete probability density functions of \( P_n \) and \( P \), respectively.
- Suppose that \( S \subseteq \R^n \) is Lebesgue measurable, \( \mathscr{S} \) the collection of Lebesgue measurable subsets of \( S \), and that \( \mu = \lambda_n \) is \( n \)-dimensional Lebesgue measure on \( (S, \mathscr{S}) \). Then \( P_n \) and \( P \) are absolutely continuous distributions with density functions \( f_n \) and \( f \), respectively.

Generating functions are studied in the chapter on Expected Value. In part, the importance of generating functions stems from the fact that ordinary (pointwise) convergence of a sequence of generating functions corresponds to the convergence of the distributions in the sense of this section. Often it is easier to show convergence in distribution using generating functions than directly from the definition.