\(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\)
  1. Random
  2. 3. Expected Value
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9
  12. 10
  13. 11
  14. 12

6. Generating Functions

As usual, our starting point is a random experiment with probability measure \(\P\) on an underlying sample space. A generating function of a real-valued random variable is an expected value of a certain transformation of the random variable involving another (deterministic) variable. Most generating functions share four important properties:

  1. Under mild conditions, the generating function completely determines the distribution of the random variable.
  2. The generating function of a sum of independent variables is the product of the generating functions
  3. The moments of the random variable can be obtained from the derivatives of the generating function.
  4. Ordinary (pointwise) convergence of a sequence of generating functions corresponds to the special convergence of the corresponding distributions.

Property 1 is most important. Often a random variable is shown to have a certain distribution by showing that the generating function has a certain form. The process of recovering the distribution from the generating function is known as inversion. Property 2 is frequently used to determine the distribution of a sum of independent variables. By contrast, recall that the probability density function of a sum of independent variables is the convolution of the individual density functions, a much more complicated operation. Property 3 is useful because often computing moments from the generating function is easier than computing the moments directly from the definition. The last property is known as the continuity theorem. Often it is easer to show the convergence of the generating functions than to prove convergence of the distributions directly.

The numerical value of the generating function at a particular value of the free variable is of no interest, and so generating functions can seem rather unintuitive at first. But the important point is that the generating function as a whole encodes all of the information in the probability distribution in a very useful way. Generating functions are important and valuable tools in probability, as they are in other areas of mathematics, from combinatorics to differential equations.

We will study the three generating functions in the list below, which correspond to increasing levels of generality. The fist is the most restrictive, but also by far the simplest, since the theory reduces to basic facts about power series that you will remember from calculus. The third is the most general and the one for which the theory is most complete and elegant, but it also requires basic knowledge of complex analysis. The one in the middle is perhaps the one most commonly used, and suffices for most distributions in applied probability.

  1. the probability generating function
  2. the moment generating function
  3. the characteristic function

We will also study the characteristic function for multivariate distributions, although analogous results hold for the other two types. In the basic theory below, be sure to try the proofs yourself before reading the ones in the text.

Basic Theory

The Probability Generating Function

Suppose that \(N\) is a random variable taking values in \(\N\). The probability generating function \(P\) of \(N\) is defined as follows, for all values \(t \in \R\) for which the expected value exists: \[ P(t) = \E\left(t^N\right) \] Let \(f\) denote the probability density function of \(N\), so that \(f(n) = \P(N = n)\) for \(n \in \N\).

The probability generating function can be obtained from the probability density function as follows: \[ P(t) = \sum_{n=0}^\infty f(n) t^n \]

Proof:

This follows from the discrete change of variables theorem for expected value.

Thus, \(P(t)\) is a power series in \(t\), with the values of the probability density function as the coefficients. In the language of combinatorics, \(P\) is the ordinary generating function of \(f\). Recall from calculus that there exists \(r \in [0, \infty]\) such that the series converges absolutely for \(\left|t\right| \lt r\) and diverges for \(\left|t\right| \gt r\). The number \(r\) is the radius of convergence of the series. Of course, if \( N \) just takes a finite set of values in \( \N \) then \( r = \infty \).

\(P(1) = 1\) and hence \(r \ge 1\).

Proof:

\( P(1) = \E\left(1^N\right) = \sum_{n=0}^\infty f(n) = 1 \)

Recall from calculus that a power series can be differentiated term by term, just like a polynomial. Each derivative series has the same radius of convergence as the original series. We denote the derivative of order \(n\) by \(P^{(n)}\). Recall also that if \(n \in \N\) and \(k \in \N\) with \(k \le n\), then the number of permutations of size \(k\) chosen from a population of \(n\) objects is \[ n^{(k)} = n (n - 1) \cdots (n - k + 1) \] The following theorem is the inversion result for probability generating functions.

The probability generating function \(P\) completely determines the distribution of \(N\): \[ f(k) = \frac{P^{(k)}(0)}{k!}, \quad k \in \N \]

Proof:

This is a standard result from the theory of power series. Differentiating \( k \) times gives \( P^{(k)}(t) = \sum_{n=k}^\infty n^{(k)} f(n) t^{n-k} \) for \( t \in (-r, r) \). Hence \( P^{(k)}(0) = k^{(k)} f(k) = k! f(k) \)

Our next result is not particularly important, but has a certain curiosity.

\(\P(N \text{ is even}) = \frac{1}{2}\left[1 + P(-1)\right]\).

Proof:

Note that \[ P(1) + P(-1) = \sum_{n=0}^\infty f(n) + \sum_{n=0}^\infty (-1)^n f(n) = 2 \sum_{k=0}^\infty f(2 k) = 2 \P(N \text{ is even }) \]

Recall that the factorial moment of \( N \) of order \( k \in \N \) is \( \E\left[N^{(k)}\right] \). The factorial moments can be computed from the derivatives of the probability generating function. The factorial moments, in turn, determine the ordinary moments about 0.

Suppose that the radius of convergence \(r \gt 1\). Then \(P^{(k)}(1) = \E\left[N^{(k)}\right]\) for \(k \in \N\). In particular, \(N\) has finite moments of all orders.

Proof:

As before, \( P^{(k)}(t) = \sum_{n=k}^\infty n^{(k)} f(n) t^{n-k} \) for \( t \in (-r, r) \). Hence if \( r \gt 1 \) then \( P^{(k)}(1) = \sum_{n=k}^\infty n^{(k)} f(n) = \E\left[N^{(k)}\right] \)

Suppose again that \( r \gt 1 \). Then

  1. \(\E(N) = P^\prime(1)\)
  2. \(\var(N) = P^{\prime \prime}(1) + P^\prime(1)\left[1 - P^\prime(1)\right]\)
Proof:
  1. \( \E(N) = \E\left[N^{(1)}\right] = P^\prime(1) \).
  2. \( \E\left(N^2\right) = \E[N (N - 1)] + \E(N) = \E\left[N^{(2)}\right] + \E(N) = P^{\prime\prime}(1) + P^\prime(1) \). Hence from (a), \( \var(N) = P^{\prime\prime}(1) + P^\prime(1) - \left[P^\prime(1)\right]^2 \).

Suppose that \(N_1\) and \(N_2\) are independent random variables taking values in \(\N\), with probability generating functions \(P_1\) and \(P_2\) having radii of convergence \( r_1 \) and \( r_2 \), respectively. Then the probability generating function \( P \) of \(N_1 + N_2\) is given by \(P(t) = P_1(t) P_2(t)\) for \( \left|t\right| \lt r_1 \wedge r_2 \).

Proof:

Recall that the expected product of independent variables is the product of the expected values. Hence \[ P(t) = \E\left(t^{N_1 + N_2}\right) = \E\left(t^{N_1} t^{N_2}\right) = \E\left(t^{N_1}\right) \E\left(t^{N_2}\right) = P_1(t) P_2(t), \quad \left|t\right| \lt r_1 \wedge r_2 \]

The Moment Generating Function

Suppose that \(X\) is a real-valued random variable. The moment generating function of \(X\) is the function \(M\) defined by \[ M(t) = \E\left(e^{tX}\right), \quad t \in \R \] Note that since \(e^{t X} \ge 0\) with probability 1, \(M(t)\) exists, as a real number or \(\infty\), for any \(t \in \R\).

Suppose that \(X\) has a continuous distribution on \(\R\) with probability density function \(f\). Then \[ M(t) = \int_{-\infty}^\infty e^{t x} f(x) \, dx \]

Proof:

This follows from the change of variables theorem for expected value.

Thus, the moment generating function of \(X\) is closely related to the Laplace transform of the probability density function \(f\). The Laplace transform is named for Pierre Simon Laplace, and is widely used in many areas of applied mathematics. The basic inversion theorem for moment generating functions (similar to the inversion theorem for Laplace transforms) states that if \(M(t) \lt \infty\) for \(t\) is some open interval about 0, then \(M\) completely determines the distribution of \(X\). Thus, if two distributions on \(\R\) have moment generating functions that are equal (and finite) in an open interval about 0, then the distributions are the same.

Suppose that \(X\) has moment generating function \(M\) that is finite in some open interval \( I \) about 0. Then \(X\) has moments of all orders and \[ M(t) = \sum_{n=0}^\infty \frac{\E\left(X^n\right)}{n!} t^n, \quad t \in I \]

Proof:

Under the hypotheses, the expected value operator can be interchanged with the infinite series for the exponential function: \[ M(t) = \E\left(e^{t X}\right) = \E\left(\sum_{n=0}^\infty \frac{X^n}{n!} t^n\right) = \sum_{n=0}^\infty \frac{\E(X^n)}{n!} t^n, \quad t \in I \] For more details about the justification for the interchange, see the advanced section on properties of the integral in the chapter on Distributions.

\(M^{(n)}(0) = \E\left(X^n\right)\) for \(n \in \N\)

Proof:

This follows by the same argument as given above for the PGF: \( M^{(n)}(0) \big/ n! \) is the coefficient of order \( n \) in the power series in the previous theorem, namely \( \E\left(X^n\right) \big/ n! \). Hence \( M^{(n)}(0) = \E\left(X^n\right) \).

Thus, the derivatives of the moment generating function at 0 determine the moments of the variable (hence the name). In the language of combinatorics, the moment generating function is the exponential generating function of the sequence of moments. Thus, a random variable that does not have finite moments of all orders cannot have a finite moment generating function. Even when a random variable does have moments of all orders, the moment generating function may not exist. A counterexample is given below.

Next we consider what happens to the moment generating function under some simple transformations of the random variables.

Suppose that \(X\) is a real-valued random variable with moment generating function \(M\) and that \(a\) and \(b\) are constants. The moment generating function \( N \) of \(Y = a + b X\) is given by \(N(t) = e^{a t} M(b t)\) for \( t \in \R \).

Proof:

\( \E\left[e^{t (a + b X)}\right] = \E\left(e^{t a} e^{t b X}\right) = e^{t a} \E\left[e^{(t b) X}\right] = e^{a t} M(b t) \).

Suppose that \(X_1\) and \(X_2\) are independent, real-valued random variables with moment generating functions \(M_1\) and \(M_2\) respectively. The moment generating function \( M \) of \(Y = X_1 + X_2\) is given by \(M(t) = M_1(t) M_2(t)\) for \( t \in \R \).

Proof:

As with the PGF, the proof for the MGF relies on the law of exponents and the fact that the expected value of a product of independent variables is the product of the expected values: \[ \E\left[e^{t (X_1 + X_2)}\right] = \E\left(e^{t X_1} e^{t X_2}\right) = \E\left(e^{t X_1}\right) \E\left(e^{t X_2}\right) = M_1(t) M_2(t) \]

The probability generating function of a variable can easily be converted into the moment generating function of the variable.

Suppose that \(X\) is a random variable taking values in \(\N\) with probability generating function \(G\) having radius of convergence \( r \). The moment generating function \( M \) of \(X\) is given by \(M(t) = G\left(e^t\right)\) for \( t \lt \ln(r) \).

Proof:

\( M(t) = \E\left(e^{t X}\right) = \E\left[\left(e^t\right)^X\right] = G\left(e^t\right) \) for \( e^t \lt r \).

The following theorem gives the Chernoff bounds, named for the mathematician Herman Chernoff. These are upper bounds on the tail events of a random variable.

If \(X\) is a real-valued random variable with moment generating function \(M\) then.

  1. \(\P(X \ge x) \le e^{-t x} M(t)\) for \(t \gt 0\)
  2. \(\P(X \le x) \le e^{-t x} M(t)\) for \(t \lt 0\)
Proof:
  1. From Markov's inequality, \(\P(X \ge x) = \P\left(e^{t X} \ge e^{t x}\right) \le \E\left(e^{t X}\right) \big/ e^{t x} = e^{-t x} M(t) \) if \(t \gt 0\).
  2. Similarly, \(\P(X \le x) = \P\left(e^{t X} \ge e^{t x}\right) \le e^{-t x} M(t) \) if \(t \lt 0\).

Naturally, the best Chernoff bound (in either (a) or (b)) is obtained by finding \(t\) that minimizes \(e^{-t x} M(t)\).

The Characteristic Function

From a mathematical point of view, the nicest of the generating functions is the characteristic function which is defined for a real-valued random variable \(X\) by \[ \chi(t) = \E\left(e^{i t X}\right) = \E\left[\cos(t X)\right] + i \E\left[\sin(t X)\right], \quad t \in \R \] Note that \(\chi\) is a complex valued function, and so this subsection requires some basic knowledge of complex analysis. The function \(\chi\) is defined for all \(t \in \R\) because the random variable in the expected value is bounded in magnitude. Indeed, \(\left|e^{i t X}\right| = 1\) for all \(t \in \R\). Many of the properties of the characteristic function are more elegant than the corresponding properties of the probability or moment generating functions, because the characteristic function always exists.

If \(X\) has a continuous distribution on \(\R\) with probability density function \(f\) then \[ \chi(t) = \int_{-\infty}^{\infty} e^{i t x} f(x) dx, \quad t \in \R \]

Proof:

This follows from the change of variables theorem for expected value, albeit a complex version.

Thus, the characteristic function of \(X\) is closely related to the Fourier transform of the probability density function \(f\). The Fourier transform is named for Joseph Fourier, and is widely used in many areas of applied mathematics.

As with other generating functions, the characteristic function completely determines the distribution. That is, random variables \(X\) and \(Y\) have the same distribution if and only if they have the same characteristic function. Indeed, the general inversion formula given next is a formula for computing certain combinations of probabilities from the characteristic function.

If \( a, \, b \in \R \) and \(a \lt b\) then \[ \int_{-n}^n \frac{e^{-i a t} - e^{- i b t}}{2 \pi i t} \chi(t) \, dt \to \P(a \lt X \lt b) + \frac{1}{2}\left[\P(X = b) - \P(X = a)\right] \text{ as } n \to \infty \]

The probability combinations on the right side completely determine the distribution of \(X\). A special inversion formula holds for continuous distributions:

Suppose that \(X\) has a continuous distribution with probability density function \(f\). At every point \(x \in \R\) where \(f\) is differentiable, \[ f(x) = \frac{1}{2 \pi} \int_{-\infty}^\infty e^{-i t x} \chi(t) \, dt \]

This formula is essentially the inverse Fourrier transform. As with the other generating functions, the characteristic function can be used to find the moments of \(X\). Moreover, this can be done even when only some of the moments exist.

If \(\E\left(\left|X^n\right|\right) \lt \infty\) then \[ \chi(t) = \sum_{k=0}^n \frac{\E\left(X^k\right)}{k!} (i t)^k + o(t^n) \] and therefore \(\chi^{(n)}(0) = i^n \E\left(X^n\right)\) for \( n \in \N \).

Next we consider how the characteristic function is changed under some simple transformations of the variables.

Suppose that \(X\) is a real-valued random variable with characteristic function \(\chi\) and that \(a\) and \(b\) are constants. The characteristic function \( \psi \) of \(Y = a + b X\) is \(\psi(t) = e^{i a t} \chi(b t)\) for \( t \in \R \).

Proof:

The proof is just like the one for the MGF: \( \psi(t) = \E\left[e^{i t (a + b X)}\right] = \E\left(e^{i t a} e^{i t b X}\right) = e^{i t a} \E\left[e^{i (t b) X}\right] = e^{i a t} \chi(b t) \).

Suppose that \(X_1\) and \(X_2\) are independent, real-valued random variables with characteristic functions \(\chi_1\) and \(\chi_2\) respectively. The characteristic function \( \chi \) of \(Y = X_1 + X_2\) is given by \(\chi(t) = \chi_1(t) \chi_2(t)\) for \( t \in \R \).

Proof:

Again, the proof is just like the one for the MGF: \[ \chi(t) = \E\left[e^{i t (X_1 + X_2)}\right] = \E\left(e^{i t X_1} e^{i t X_2}\right) = \E\left(e^{i t X_1}\right) \E\left(e^{i t X_2}\right) = \chi_1(t) \chi_2(t) \]

The characteristic function of a random variable can be obtained from the moment generating function, under the basic existence condition that we saw earlier.

Suppose that \(X\) is a real-valued random variable with moment generating function \(M\) that satisfies \(M(t) \lt \infty\) for \(t\) in some open interval \(I\) about 0. Then the characteristic function \(\chi\) of \(X\) satisfies \(\chi(t) = M(i t)\) for \(t \in I\).

The final important property of characteristic functions that we will discuss relates to convergence in distribution. Suppose that \((X_1, X_2, \ldots)\) is a sequence of real-valued random with characteristic functions \((\chi_1, \chi_2, \ldots)\) respectively. Since we are only concerned with distributions, the random variables need not be defined on the same probability space.

The continuity theorem:

  1. If the distribution of \(X_n\) converges to the distribution of a random variable \(X\) as \(n \to \infty\) and \(X\) has characteristic function \(\chi\), then \(\chi_n(t) \to \chi(t)\) as \(n \to \infty\) for all \(t \in \R\).
  2. Conversely, if \(\chi_n(t)\) converges to a function \(\chi(t)\) as \(n \to \infty\) for \(t\) in some open interval about 0, and if \(\chi\) is continuous at 0, then \(\chi\) is the characteristic function of a random variable \(X\), and the distribution of \(X_n\) converges to the distribution of \(X\) as \(n \to \infty\).

There are analogous versions of the continuity theorem for probability generating functions and moment generating functions. The continuity theorem can be used to prove the central limit theorem, one of the fundamental theorems of probability. Also, the continuity theorem has a straightforward generalization to distributions on \(\R^n\).

The Joint Characteristic Function

All of the generating functions that we have discussed have multivariate extensions. However, we will discuss the extension only for the characteristic function, the most important and versatile of the generating functions. There are analogous results for the other generating functions. Thus, suppose that \((X, Y)\) is a random vector for an experiment, taking values in \(\R^2\). The (joint) characteristic function of \((X, Y)\) is defined by \[ \chi(s, t) = \E\left[\exp(i s X + i t Y)\right], \quad (s, t) \in \R^2 \] Once again, the most important fact is that \(\chi\) completely determines the distribution: two random vectors taking values in \(\R^2\) have the same characteristic function if and only if they have the same distribution.

The joint moments can be obtained from the derivatives of the characteristic function.

If \(m, \, n \in \N\) and \(\E\left(\left|X^m Y^n\right|\right) \lt \infty\) then \[ \chi^{(m, n)}(0, 0) = e^{i \, (m + n)} \E\left(X^m Y^n\right) \]

The marginal characteristic functions and the characteristic function of the sum can be easily obtained from the joint characteristic function:

Let \(\chi_1\), \(\chi_2\), and \(\chi_+\) denote the characteristic functions of \(X\), \(Y\), and \(X + Y\), respectively. For \(t \in \R\)

  1. \(\chi(t, 0) = \chi_1(t)\)
  2. \(\chi(0, t) = \chi_2(t)\)
  3. \(\chi(t, t) = \chi_+(t)\)
Proof:

All three results follow immediately from the definitions.

\(X\) and \(Y\) are independent if and only if \(\chi(s, t) = \chi_1(s) \chi_2(t)\) for all \((s, t) \in \R^2\).

Naturally, the results for bivariate characteristic functions have analogies in the general multivariate case. Only the notation is more complicated.

Examples and Applications

As always, be sure to try the computational problems yourself before reading the solutions and answers in the text.

Bernoulli Trials

Suppose \(X\) is an indicator random variable with \(p = \P(X = 1)\), where \(p \in [0, 1]\) is a parameter. Then \(X\) has probability generating function \(P(t) = 1 - p + p t\) for \(t \in \R\).

Proof:

\( P(t) = \E\left(t^X\right) = t^0 (1 - p) + t^1 p = 1 - p + p t \) for \( t \in \R \).

Recall that a Bernoulli trials process is a sequence \((X_1, X_2, \ldots)\) of independent, identically distributed indicator random variables. In the usual language of reliability, \(X_i\) denotes the outcome of trial \(i\), where 1 denotes success and 0 denotes failure. The probability of success \(p = \P(X_i = 1)\) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate chapter on the Bernoulli Trials explores this process in more detail.

The number of successes in the first \(n\) trials is \(Y_n = \sum_{i=1}^n X_i\). Recall that this random variable has the binomial distribution with parameters \(n\) and \(p\), which has probability density function \( f_n \) given by \[ f_n(y) = \binom{n}{y} p^y (1 - p)^{n - y}, \quad y \in \{0, 1, \ldots, n\} \]

\(Y_n\) has probability generating function \(P_n(t) = (1 - p + p t)^n\) for \( t \in \R \).

Proof:

This follows immediately from the PGF of an indicator variable above and the general result above for sums.

If \(Y_n\) has the binomial distribution with parameters \(n\) and \(p\) then

  1. \(\E\left[Y_n^{(k)}\right] = n^{(k)} p^k\)
  2. \(\E\left(Y_n\right) = n p\)
  3. \(\var\left(Y_n\right) = n p (1 - p)\)
  4. \(\P(Y_n \text{ is even}) = \frac{1}{2}\left[1 - (1 - 2 p)^n\right]\)
Proof:
  1. Repeated differentiation gives \( P^{(k)}(t) = n^{(k)} p^k (1 - p + p t)^{n-k} \). Hence \( P^{(k)}(1) = n^{(k)} p^k \), which is \( \E\left[X^{(k)} \right]\) by the general moment result above
  2. This follows from the general result above for the mean.
  3. This follows from the general result above for the variance.
  4. This follows from the general result above for the probability of an even value.

Suppose that \(U\) has the binomial distribution with parameters \(m\) and \(p\), \(V\) has the binomial distribution with parameters \(n\) and \(q\), and that \(U\) and \(V\) are independent.

  1. If \(p = q\) then \(U + V\) has the binomial distribution with parameters \(m + n\) and \(p\).
  2. If \(p \ne q\) then \(U + V\) does not have a binomial distribution.
Proof:

From the general result above for sums, note that the probability generating function of \(U + V\) is \(P(t) = (1 - p + p \, t)^m (1 - q + q \, t)^n\) for \(t \in \R\).

  1. If \( p = q \) then \( U + V \) has PGF \( P(t) = (1 - p + p t)^{m + n} \), which is the PGF of the binomial distribution with parameters \( m + n \) and \( p \).
  2. On the other hand, if \( p \ne q \), the PGF \( P \) does not have the functional form of a binomial PGF.

Suppose now that \( p \in (0, 1] \). The trial number \( N \) of the first success in the sequence of Bernoulli trials has the geometric distribution on \( \N_+ \) with success parameter \( p \). The probability density function is given by \[h(n) = p (1 - p)^{n-1}, \quad n \in \N_+\] The geometric distribution is studied in more detail in the chapter on Bernoulli trials.

Suppose that \(N\) has the geometric distribution on \( \N_+ \) with success parameter \( p \in (0, 1] \). Let \(Q\) denote the probability generating function of \(N\). Then

  1. \(Q(t) = \frac{p t}{1 - (1 - p)t}\) for \(-\frac{1}{1 - p} \lt t \lt \frac{1}{1 - p}\)
  2. \(\E\left[N^{(k)}\right] = k! \frac{(1 - p)^{k-1}}{p^k}\) for \( k \in \N \)
  3. \(\E(N) = \frac{1}{p}\)
  4. \(\var(N) = \frac{1 - p}{p^2}\)
  5. \(\P(N \text{ is even}) = \frac{1 - p}{2 - p}\)
Proof:
  1. Using the formula for the sum of a geometric series, \[ Q(t) = \sum_{n=1}^\infty (1 - p)^{n-1} p t^n = p t \sum_{n=1}^\infty [(1 - p) t]^{n-1} = \frac{p t}{1 - (1 - p) t}, \quad \left|(1 - p) t\right| \lt 1 \]
  2. Repeated differentiation gives \( H^{(k)}(t) = k! p (1 - p)^{k-1} \left[1 - (1 - p) t\right]^{-(k+1)} \) and then the result follows from the general result above for moments.
  3. This follows from (b) and the general result above for the mean.
  4. This follows from (b) and the general result above for the variance.
  5. This follows from the general result above for the probability of an even value.

The probability that \( N \) is even comes up in the alternating coin tossing game with two players.

The Poisson Distribution

Recall that the Poisson distribution has probability density function \[ f(n) = e^{-a} \frac{a^n}{n!}, \quad n \in \N \] where \(a \gt 0\) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number of random points in a region of time or space; the parameter is proportional to the size of the region of time or space. The Poisson distribution is studied in more detail in the chapter on the Poisson Process.

Suppose that \(N\) has Poisson distribution with parameter \(a \gt 0\). Let \(P_a\) denote the probability generating function of \(N\). Then

  1. \(P_a(t) = e^{a (t - 1)}\) for \(t \in \R\)
  2. \(\E\left[N^{(k)}\right] = a^k\)
  3. \(\E(N) = a\)
  4. \(\var(N) = a\)
  5. \(\P(N \text{ is even}) = \frac{1}{2}\left(1 + e^{-2 a}\right)\)
Proof:
  1. Using the exponential series, \[ P_a(t) = \sum_{n=0}^\infty e^{-a} \frac{a^n}{n!} t^n = e^{-a} \sum_{n=0}^\infty \frac{(a t)^n}{n!} = e^{-a} e^{a t}, \quad t \in \R \]
  2. Repeated differentiation gives \( P_a^{(k)}(t) = e^{a (t - 1)} a^k \), so the result follows from the general result above for moments.
  3. This follows from (b) and the general result above for the mean.
  4. This follows from (b) and the general result above for the variance.
  5. This follows from the general result above for the probabilty of an even value.

Suppose that \(X\) has the Poisson distribution with parameter \(a \gt 0\), \(Y\) has the Poisson distribution with parameter \(b \gt 0\), and that \(X\) and \(Y\) are independent. Then \(X + Y\) has the Poisson distribution with parameter \(a + b\).

Proof:

In the notation of the previous result, note that \( P_a P_b = P_{a+b} \).

Suppose that \(N\) has the Poisson distribution with parameter \(a \gt 0\). Then \[ \P(N \ge n) \le e^{n - a} \left(\frac{a}{n}\right)^n, \quad n \gt a \]

Proof:

The PGF of \( N \) is \( P(t) = e^{a (t - 1)} \) and hence the MGF is \( P\left(e^t\right) = \exp\left(a e^t - a\right) \). From the Chernov bounds we have \[ \P(N \ge n) \le e^{-t n} \exp\left(a e^t - a\right) = \exp\left(a e^t - a - tn\right) \] If \( n \gt a \) the expression on the right is minimized when \( t = \ln(n / a) \). Substituting gives the upper bound.

The following theorem gives an important convergence result that is explored in more detail in the chapter on the Poisson process.

Suppose that \( p_n \in (0, 1) \) for \( n \in \N_+ \) and that \( n p_n \to a \gt 0 \) as \( n \to \infty \). Then the binomial distribution with parameters \( n \) and \( p_n \) converges to the Poisson distribution with parameter \( a \) as \( n \to \infty \).

Proof:

Let \(P_n\) denote the probability generating function of the binomial distribution with parameters \(n\) and \(p_n\). From the binomial PDF above we have \( P_n(t) = \left[1 + p_n (t - 1)\right]^n\) for \( t \in \R \). Using a famous theorem from calculus, \( P_n(t) \to e^{a (t - 1)} \) as \( n \to \infty \). But this is the PGF of the Poisson distribution with parameter \( a \), so the result follows from the continuity theorem.

The Exponential Distribution

Recall that the exponential distribution is a continuous distribution with probability density function \[ f(t) = r e^{-r t}, \quad 0 \le t \lt \infty \] where \(r \gt 0\) is the rate parameter. This distribution is widely used to model failure times and other random times, and in particular governs the time between arrivals in the Poisson model. The exponential distribution is studied in more detail in the chapter on the Poisson Process.

Suppose that \(T\) has the exponential distribution with rate parameter \(r \gt 0\). Let \(M\) denote the moment generating function of \(T\).

  1. \(M(s) = \frac{r}{r - s}\) for \(-\infty \lt s \lt r\).
  2. \(\E(T^n) = n! \big/ r^n\) for \(n \in \N\)
Proof:
  1. \( M(s) = \E\left(e^{s T}\right) = \int_0^\infty e^{s t} r e^{-r t} \, dt = \int_0^\infty r e^{(s - r) t} \, dt = \frac{r}{r - s} \) for \( s \lt r \).
  2. \( M^{(n)}(s) = \frac{r n!}{(r - s)^{n+1}} \) for \( n \in \N \)

Suppose that \((T_1, T_2, \ldots)\) is a sequence of independent random variables, each having the exponential distribution with rate parameter \(r \gt 0\). The moment generating function \( M_n \) of \(U_n = \sum_{i=1}^n T_i\) is \[ M_n(s) = \left(\frac{r}{r - s}\right)^n, \quad s \in (-\infty, r) \]

Proof:

This follows from the previous exercise and the general result above for sums.

Random variable \(U_n\) has the Erlang distribution with shape parameter \(n\) and rate parameter \(r\), named for Agner Erlang. This distribution governs the \( n \)th arrival time in the Poisson model. The Erlang distribution is studied in more detail in the chapter on the Poisson Process.

Uniform Distributions

Suppose that \( a, \, b \in \R \) and \( a \lt b \). Recall that the continuous uniform distribution on the interval \( [a, b] \) has probability density function \( f \) given by \[ f(x) = \frac{1}{b - a}, \quad x \in [a, b] \] The distribution corresponds to selecting a point at random from the interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems.

Suppose that \(X\) is uniformly distributed on the interval \([a, b]\). Let \(M\) denote the moment generating function of \(X\). Then

  1. \(M(t) = \frac{e^{b t} - e^{a t}}{(b - a)t}\) if \( t \ne 0 \) and \( M(0) = 1 \)
  2. \(\E\left(X^n\right) = \frac{b^{n+1} - a^{n + 1}}{(n + 1)(b - a)}\) for \(n \in \N\)
Proof:
  1. \( M(t) = \int_a^b e^{t x} \frac{1}{b - a} \, dx = \frac{e^{b t} - e^{a t}}{(b - a)t}\) if \( t \ne 0 \). Trivially \( M(0) = 1 \)
  2. This is a case where the MGF is not helpful, and it's much easier to compute the moments directly: \( \E\left(X^n\right) = \int_a^b x^n \frac{1}{b - a} \, dx = \frac{b^{n+1} - a^{n + 1}}{(n + 1)(b - a)} \)

Suppose that \((X, Y)\) is uniformly distributed on the triangle \(T = \{(x, y) \in \R^2: 0 \le x \le y \le 1\}\). Compute each of the following:

  1. The joint moment generating function of \((X, Y)\).
  2. The moment generating function of \(X\).
  3. The moment generating function of \(Y\).
  4. The moment generating function of \(X + Y\).
Answer:
  1. \(M(s, t) = 2 \frac{e^{s+t} - 1}{s (s + t)} - 2 \frac{e^t - 1}{s t}\) if \( s \ne 0, \; t \ne 0\). \( M(0, 0) = 1 \)
  2. \(M_1(s) = 2 \left(\frac{e^2}{s^2} - \frac{1}{s^2} - \frac{1}{s}\right)\) if \(s \ne 0\). \( M_1(0) = 1 \)
  3. \(M_2(t) = 2 \frac{t e^t - e^t + 1}{t^2}\) if \( t \ne 0\). \( M_2(0) = 1 \)
  4. \(M_+(t) = \frac{e^{2 t} - 1}{t^2} - 2 \frac{e^t - 1}{t^2}\) if \(t \ne 0\). \( M_+(0) = 1 \)

A Bivariate Distribution

Suppose that \( (X, Y) \) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Compute each of the following:

  1. The joint moment generating function \( (X, Y) \).
  2. The moment generating function of \(X\).
  3. The moment generating function of \(Y\).
  4. The moment generating function of \(X + Y\).
Answer:
  1. \(M(s, t) = \frac{e^{s+t}(-2 s t + s + t) + e^s(s t - s - t) + s + t}{s^2 t^2}\) if \(s \ne 0, \, t \ne 0\). \( M(0, 0) = 1 \)
  2. \(M_1(s) = \frac{3 s e^2 - 2 e^2 - s + 2}{2 s^2}\) if \(s \ne 0\). \( M_1(0) = 1 \)
  3. \(M_2(t) = \frac{3 t e^t - 2 e^t - t + 2}{2 t^2}\) if \(t \ne 0\). \( M_2(0) = 1 \)
  4. \(M_+(t) = \frac{[e^{2 t}(1 - t) + e^t (t - 2) + 1]}{t^3}\) if \(t \ne 0\). \( M_+(0) = 1 \)

The Normal Distribution

Recall that the standard normal distribution is a continuous distribution with probability density function \[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}, \quad z \in \R \] Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in more detail in the chapter on Special Distributions.

Suppose that \(Z\) has the standard normal distribution. Let \(M\) denote the moment generating function of \(Z\). Then

  1. \(M(t) = e^{\frac{1}{2} t^2}\) for \(t \in \R\)
  2. \(\E\left(Z^n\right) = 1 \cdot 3 \cdots (n - 1)\) if \( n \) is even and \(\E\left(Z^n\right) = 0\) if \(n\) is odd.
Proof:
  1. First, \[ M(t) = \E\left(e^{t Z}\right) = \int_{-\infty}^\infty e^{t z} \frac{1}{\sqrt{2 \pi}} e^{-z^2 / 2} \, dz = \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{z^2}{2} + t z\right) \, dz \] Completing the square in \( z \) gives \(\exp\left(-\frac{z^2}{2} + t z\right) = \exp\left[\frac{1}{2} t^2 - \frac{1}{2}(z - t)^2 \right] = e^{\frac{1}{2} t^2} \exp\left[-\frac{1}{2} (z - t)^2\right] \). hence \[ M(t) = e^{\frac{1}{2} t^2} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} \exp\left[-\frac{1}{2} (z - t)^2\right] \, dz = e^{\frac{1}{2} t^2} \] because the function of \( z \) in the last integral is the probability density function for the normal distribution with mean \( t \) and variance 1.
  2. Note that \( M^\prime(t) = t M(t) \). Thus, repeated differentiation gives \( M^{(n)}(t) = p_n(t) M(t) \) for \( n \in \N \), where \( p_n \) is a polynomial of degree \( n \) satisfying \( p_{n+1}^\prime(t) = t p_n(t) + p_n^\prime(t) \). Since \( p_0 = 1 \), it's easy to see that \( p_n \) has only even or only odd terms, depending on whether \( n \) is even or odd, respectively. Thus, \( \E\left(X^n\right) = p_n(0) \). This is 0 if \( n \) is odd, and is the constant term \( 1 \cdot 3 \cdots (n - 1) \) if \( n \) is even. Of course, we can also see that the odd order moments must be 0 by symmetry.

More generally, for \(\mu \in \R\) and \(\sigma \in (0, \infty)\), recall that the normal distribution with mean \(\mu\) and standard deviation \(\sigma\) is a continuous distribution with probability density function \( f \) given by \[ f(x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left[-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right], \quad x \in \R \] Moreover, if \( Z \) has the standard normal distribution, then \( X = \mu + \sigma Z \) has the normal distribution with location parameter \( \mu \) and scale parameter \( \sigma \). Thus, we can easily find the moment generating function of \( X \):

Suppose that \(X\) has the normal distribution with mean \(\mu\) and standard deviation \(\sigma\). The moment generating function of \(X\) is \[M(t) = \exp\left(\mu t + \frac{1}{2} \sigma^2 t^2\right), \quad t \in \R\]

Proof:

This follows easily the MGF of the standard normal distribution and the general result above for location-scale transformations: \( X = \mu + \sigma Z \) where \( Z \) has the standard normal distribution. Hence \[ M(t) = \E\left(e^{t X}\right) = e^{\mu t} \E\left(e^{\sigma t Z}\right) = e^{\mu t} e^{\frac{1}{2} \sigma^2 t ^2}, \quad t \in \R \]

If \(X\) and \(Y\) are independent, normally distributed random variables then \(X + Y\) has a normal distribution.

Proof:

Suppose that \( X \) has the normal distribution with mean \( \mu \) and standard deviation \( \sigma \), and that \( Y \) has the normal distribution with mean \( \nu \) and standard deviation \( \tau \). By the general result above for sums, the MGF of \( X + Y \) is \[ M_{X+Y}(t) = M_X(t) M_Y(t) = \exp\left(\mu t + \frac{1}{2} \sigma^2 t^2\right) \exp\left(\nu t + \frac{1}{2} \tau^2 t^2\right) = \exp\left[(\mu + \nu) t + \frac{1}{2}\left(\sigma^2 + \tau^2\right) t^2 \right] \] which we recognize as the MGF of the normal distribution with mean \( \mu + \nu \) and variance \( \sigma^2 + \tau^2 \). Of course, we already knew that \( \E(X + Y) = \E(X) + \E(Y) \), and since \( X \) and \( Y \) are independent, \( \var(X + Y) = \var(X) + \var(Y) \), so the new information is that the distribution is also normal.

The Pareto Distribution

Recall that the Pareto distribution is a continuous distribution with probability density function \[ f(x) = \frac{a}{x^{a + 1}}, \quad 1 \le x \lt \infty \] where \(a \gt 0\) is the shape parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is widely used to model financial variables such as income. The Pareto distribution is studied in more detail in the chapter on Special Distributions.

Suppose that \(X\) has the Pareto distribution with shape parameter \( a \), and let \( M \) denote the moment generating function of \( X \). Then

  1. \(\E\left(X^n\right) = \frac{a}{a - n}\) if \(n \lt a\) and \(\E\left(X^n\right) = \infty\) if \(n \ge a\)
  2. \(M(t) = \infty\) for \(t \gt 0\)
Proof:
  1. We have seen this computation before. \( \E\left(X^n\right) = \int_1^\infty x^n \frac{a}{x^{a+1}} \, dx = \int_1^\infty x^{n - a - 1} \, dx \). The integral evaluates to \( \frac{a}{a - n} \) if \( n \lt a \) and \( \infty \) if \( n \ge a \).
  2. This follows from part (a). Since \( X \ge 1 \), \( M(t) \) is increasing in \( t \). Thus \( M(t) \le 1 \) if \( t \lt 0 \). If \( M(t) \lt \infty \) for some \( t \gt 0 \), then \( M(t) \) would be finite for \( t \) in an open interval about 0, in which case \( X \) would have finite moments of all orders. Of course, it's also easy to see directly from the integral that \( M(t) = \infty \) for \( t \gt 0 \)

On the other hand, like all distributions on \( \R \), the Pareto distribution has a characteristic function. However, the characteristic function of the Pareto distribution does not have a simple, closed form.

The Cauchy Distribution

Recall that the (standard) Cauchy distribution is a continuous distribution with probability density function \[ f(x) = \frac{1}{\pi \left(1 + x^2\right)}, \quad x \in \R \] and is named for Augustin Cauchy. The Cauch distribution is studied in more generality in the chapter on Special Distributions. The graph of \(f\) is known as the Witch of Agnesi, named for Maria Agnesi.

Suppose that \( X \) has the standard Cauchy distribution, and let \(M\) denote the moment generating function of \(X\). Then

  1. \(\E(X)\) does not exist.
  2. \(M(t) = \infty\) for \(t \ne 0\).
Proof:
  1. We have seen this computation before. \( \int_a^\infty \frac{x}{\pi (1 + x^2)} \, dx = \infty \) and \( \int_{-\infty}^a \frac{x}{\pi (1 + x^2)} \, dx = -\infty \) for every \( a \in \R \), so \( \int_{-\infty}^\infty \frac{x}{\pi (1 + x^2)} \, dx \) does not exist.
  2. Note that \( \int_0^\infty \frac{e^{t x}}{\pi (1 + x^2) } \, dx = \infty\) if \( t \ge 0 \) and \( \int_{-\infty}^0 \frac{e^{t x}}{\pi (1 + x^2)} \, dx = \infty \) if \( t \le 0 \).

Once again, all distributions on \( \R \) have characteristic functions, and the standard Cauchy distribution has a particularly simple one.

Let \(\chi\) denote the characteristic function of \(X\). Then \(\chi(t) = e^{-\left|t\right|}\) for \(t \in \R\).

Proof:

The proof of this result requires contour integrals in the complex plane, and is given in the section on the Cauchy distribution in the chapter on special distributions.

Counterexample

For the Pareto distribution, only some of the moments are finite; so course, the moment generating function cannot be finite in an interval about 0. We will now give an example of a distribution for which all of the moments are finite, yet still the moment generating function is not finite in any interval about 0. Furthermore, we will see two different distributions that have the same moments of all orders.

Suppose that Z has the standard normal distribution and let \(X = e^Z\). The distribution of \(X\) is known as the (standard) lognormal distribution. The lognormal distribution is studied in more generality in the chapter on Special Distributions. This distribution has finite moments of all orders, but infinite moment generating function.

\(X\) has probability density function \[ f(x) = \frac{1}{\sqrt{2 \pi} x} \exp\left(-\frac{1}{2} \ln^2(x)\right), \quad x \gt 0 \]

  1. \(\E\left(X^n\right) = e^{\frac{1}{2}n^2}\) for \(n \in \N\).
  2. \(\E\left(e^{t X}\right) = \infty\) for \(t \gt 0\).
Proof:

We use the change of variables theorem. The transformation is \( x = e^z \) so the inverse transformation is \( z = \ln(x) \) for \( x \in (0, \infty) \) and \( z \in \R \). Letting \( \phi \) denote the PDF of \( Z \), it follows that the PDF of \( X \) is \( f(x) = \phi(z) \, dz / dx = \phi\left[\ln(x)\right] \big/ x \) for \( x \gt 0 \).

  1. We use the moment generating function of the standard normal distribution given above: \( \E\left(X^n\right) = \E\left(e^{n Z}\right) = e^{n^2 / 2}\).
  2. Note that \[ \E\left(e^{t X}\right) = \E\left[\sum_{n=0}^\infty \frac{(t X)^n}{n!}\right] = \sum_{n=0}^\infty \frac{\E(X^n)}{n!} t^n = \sum_{n=0}^\infty \frac{e^{n^2 / 2}}{n!} t^n = \infty, \quad t \gt 0 \] The interchange of expected value and sum is justified since \( X \) is nonnegative. See the advanced section on properties of the integral in the chapter on Distributions for more details.

Next we construct a different distribution with the same moments as \( X \).

Now let \(h(x) = \sin\left[2 \pi \ln(x)\right]\) for \(x \gt 0\) and let \(g(x) = f(x)\left[1 + h(x)\right]\) for \(x \gt 0\). Then

  1. \(g\) is a probability density function.
  2. If \( Y \) has probability density function \( g \) then \(\E\left(Y^n\right) = e^{\frac{1}{2} n^2}\) for \(n \in \N\)
Proof:

Note first that \( g(x) \ge 0 \) for \( x \gt 0 \). Next, let \( U \) have the normal distribution with mean \( n \) and variance 1. Using the change of variables \(u = \ln(x)\) and completing the square shows that for \(n \in \N\), \[ \int_0^\infty x^n f(x) h(x) \, dx = e^{-\frac{1}{2}n^2} \E\left[\sin(2 \pi U)\right] \] From symmetry it follows that \( \int_0^\infty x^n f(x) h(x) \, dx = 0 \) for \( n \in \N \). Therefore \[ \int_0^\infty x^n g(x) \, dx = \int_0^\infty x^n f(x) \, dx + \int_0^\infty x^n f(x) h(x) \, dx = \int_0^\infty x^n f(x) \, dx \] Letting \( n = 0 \) shows that \( g \) is a PDF, and then more generally, the moments of \( Y \) are the same as the moments of \( X \).

Densities of two distributions with the same moments
The graphs of \( f \) and \( g \), probability density functions for two distributions with the same moments of all orders.