\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\N}{\mathbb{N}}\)
\(\newcommand{\E}{\mathbb{E}}\)
\(\newcommand{\P}{\mathbb{P}}\)
\(\newcommand{\var}{\text{var}}\)
\(\newcommand{\sd}{\text{sd}}\)
\(\newcommand{\cov}{\text{cov}}\)
\(\newcommand{\cor}{\text{cor}}\)
\(\newcommand{\vc}{\text{vc}}\)
\(\newcommand{\bs}{\boldsymbol}\)

- Virtual Laboratories
- 4. Special Distributions
- The Multivariate Normal Distribution

The multivariate normal distribution is among the most important of multivariate distributions, particularly in statistical inference and the study of Gaussian processes such as Brownian motion. The distribution arises naturally from linear transformations of independent normal variables. In this section, we consider the bivariate normal distribution first, because explicit results can be given and because graphical interpretations are possible. Then, with the aid of matrix notation, we discuss the general multivariate distribution.

Recall that the probability density function \( \phi \) of the standard normal distribution is given by \[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-z^2/2}, \quad z \in \R\] The corresponding distribution function is denoted \( \Phi \) and is considered a special function in mathematics: \[ \Phi(z) = \int_{-\infty}^z \phi(x) \, dx = \int_{-\infty}^z \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} \, dx, \quad z \in \R \] Finally, the moment generating function \( m \) is given by \[ m(t) = \E\left(e^{t Z}\right) = \exp\left[\frac{1}{2} \var(t Z)\right] = e^{t^2/2}, \quad t \in \R \]

Suppose that \( Z \) and \( W \) are independent random variables, each with the standard normal distribution. The distribution of \( (Z, W) \) is known as the standard bivariate normal distribution.

The basic properties of the standard bivariate normal distribution follow easily from independence and properties of the (univariate) normal distribution. Recall first that the graph of a function \( f: \R^2 \to \R \) is a surface. For \( c \in \R \), the set of points \( \left\{(x, y) \in \R^2: f(x, y) = c\right\} \) is the level curve of \( f \) at level \( c \). The graph of \( f \) can be understood by means of the level curves.

The probability density function \( \phi_2 \) of the standard bivariate normal distribution is given by \[ \phi_2(z, w) = \frac{1}{2 \pi} e^{-\frac{1}{2}\left(z^2 + w^2\right)}, \quad (z, w) \in \R^2 \]

- The level curves of \( \phi_2 \) are circles centered at the origin.
- The mode of the distribution is \( (0, 0) \).
- \( \phi_2 \) is concave downward on \( \left\{(z, w) \in \R^2: z^2 + w^2 \lt 1\right\} \)

By independence, \( \phi_2(z, w) = \phi(z) \phi(w) \). Parts (a) and (b) are clear. For part (c), the second derivative matrix of \( \phi_2 \) is \[ \left[\begin{matrix} \phi_2(z, w)\left(z^2 - 1\right) & \phi_2(z, w) z w \\ \phi_2(z, w) z w & \phi_2(z, w)\left(w^2 - 1\right) \end{matrix}\right]\] with determinant \( \phi_2^2(z, w) \left(1 - z^2 - w^2\right) \). So the matrix is positive definite on the circular region \( \left\{(z, w) \in \R^2: z^2 + w^2 \lt 1\right\} \).

Clearly \( \phi \) has a number of symmetry properties as well: \( \phi_2(z, w) \) is symmetric in \( z \) about 0 so that \(\phi_2(-z, w) = \phi_2(z, w)\); \( \phi_2(z, w) \) is symmetric in \( w \) about 0 so that \( \phi_2(z, -w) = \phi_2(z, w) \); \( \phi_2(z, w) \) is symmetric in \( (z, w) \) so that \( \phi_2(z, w) = \phi_2(w, z) \). In short, \( \phi_2 \) has the classical bell shape

that we associate with normal distributions.

Open the bivariate normal experiment, keep the default settings to get the standard bivariate normal distribution. Run the experiment 1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density functions.

Suppose that \( (Z, W) \) has the standard bivariate normal distribution. The moment generating function \( m_2 \) of \( (Z, W) \) is given by \[ m_2(s, t) = \E\left[\exp(s Z + t W)\right] = \exp\left[\frac{1}{2} \var(s Z + t W)\right] = \exp\left[\frac{1}{2}\left(s^2 + t^2\right)\right], \quad (s, t) \in \R^2 \]

By independence, \( m_2(s, t) = m(s) m(t) \) for \( (s, t) \in \R^2 \) where \( m \) is the standard normal MGF.

The general bivariate normal distribution can be constructed by means of an affine transformation on a standard bivariate normal vector. The distribution has 5 parameters. As we will see, two are location parameters, two are scale parameters, and one is a correlation parameter.

Suppose that \( (Z, W) \) standard bivariate normal distribution. Let \(\mu, \, \nu \in \R\); \(\sigma, \, \tau \in (0, \infty)\); and \(\rho \in (-1, 1)\), and let \(X\) and \(Y\) be new random variables defined by \begin{align} X & = \mu + \sigma \, Z \\ Y & = \nu + \tau \, \rho \, Z + \tau \sqrt{1 - \rho^2} \, W \end{align} The joint distribution of \((X, Y)\) is called the bivariate normal distribution with parameters \((\mu, \nu, \sigma, \tau, \rho)\).

The following theorem gives basic properties of the bivariate normal distribution.

Suppose that \((X, Y)\) has the bivariate normal distribution with parameters \((\mu, \nu, \sigma, \tau, \rho)\). Then

- \(X\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\).
- \(Y\) is normally distributed with mean \(\nu\) and standard deviation \(\tau\).
- \(\cor(X, Y) = \rho\).
- \( X \) and \( Y \) are independent if and only if \( \rho = 0 \).

- This is a basic property of the normal distribution, and indeed is the way that the general normal variable is constructed from a standard normal variable.
- Similarly, \( \tau \rho Z \) and \( \tau \sqrt{1 - \rho^2} W \) are independent and each has a normal distribution. By another basic property of the normal distribution, \( \tau \rho Z + \tau \sqrt{1 - \rho^2} W \) also has a normal distribution, and thus so does \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \). Since \( Z \) and \( W \) have mean 0, it follows from the linear property of expected value that \( \E(Y) = \nu \). Similarly, since \( Z \) and \( W \) have variance 1, it follows from basic properties of variance that \( \var(Y) = \tau^2 \rho^2 + \tau^2 (1 - \rho^2) = \tau^2\).
- Using bi-linear property of covariance we have \( \cov(X, Y) = \rho \tau \sigma \cov(Z, Z) = \rho \tau \sigma \), and hence from (a) and (b), \( \cor(X, Y) = \rho \).
- As a general property, recall that if \( X \) and \( Y \) are independent then \( \cor(X, Y) = 0 \). Conversely, if \( \rho = 0 \) then \( X = \mu + \sigma Z \) and \( Y = \nu + \tau W \). Since \( Z \) and \( W \) are independent, so are \( X \) and \( Y \).

Thus, for two random variables with a joint normal distribution, the random variables are independent if and only if they are uncorrelated.

In the bivariate normal experiment, change the standard deviations of \(X\) and \(Y\) with the scroll bars. Watch the change in the shape of the probability density functions. Now change the correlation with the scroll bar and note that the probability density functions do *not* change. For various values of the parameters, run the experiment 1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density functions.

We can use the change of variables formula to find the joint probability density function of \((X, Y)\).

The joint probability density function of \((X, Y)\) is \[ f(x, y) = \frac{1}{2 \, \pi \, \sigma \, \tau \, \sqrt{1 - \rho^2}} \exp \left\{ -\frac{1}{2 \, (1 - \rho^2)} \left[ \frac{(x - \mu)^2}{\sigma^2} - 2 \, \rho \frac{(x - \mu)(y - \nu)}{\sigma \, \tau} + \frac{(y - \nu)^2}{\tau^2} \right] \right\}, \quad (x, y) \in \R^2 \]

- The level curves of \( f \) are ellipses centered at \((\mu, \nu)\).
- The mode is \( (\mu, \nu) \).

Consider the transformation that defines \((x, y)\) from \((z, w)\) in the definition. The inverse transformation is given by \begin{align} z & = \frac{x - \mu}{\sigma} \\ w & = \frac{x - \mu}{\sigma \, \sqrt{1 - \rho^2}} - \rho \, \frac{y - \nu}{\tau \, \sqrt{1 - \rho^2}} \end{align} The Jacobian of the inverse transformation is \[ \frac{\partial(z, w)}{\partial(x, y)} = \frac{1}{\sigma \, \tau \, \sqrt{1 - \rho^2}} \] Note that the Jacobian is a constant, because the transformation is affine. The result now follows from the independence of \(Z\) and \(W\), and the change of variables formula

Note that \( f \) has the form \( f(x, y) = a \exp\left[- b g(x, y)\right] \) where \( a \) and \( b \) are positive constants and \[ g(x, y) = \frac{(x - \mu)^2}{\sigma^2} - 2 \, \rho \frac{(x - \mu)(y - \nu)}{\sigma \, \tau} + \frac{(y - \nu)^2}{\tau^2} \] The graph of \( g \) is a paraboloid opening upward. The level curves of \( f \) are the same as the level curves of \( g \) (but at different levels of course).

In the case of perfect correlation (\( \rho = 1 \) or \( \rho = -1 \)), the distribution of \( (X, Y) \) is also said to be bivariate normal, but degenerate. In this case, we know from our study of covariance and correlation that \( (X, Y) \) takes values in the regression line \( \left\{(x, y) \in \R^2: y = \nu + \rho \frac{\tau}{\sigma} (x - \mu)\right\} \), and hence does not have a probability density function (with respect to Lebesgue measure on \( \R^2 \)). Degenerate normal distributions will be discussed in more detail later.

In the bivariate normal experiment, run the experiment 1000 times with the values of \( \rho \) given below and selected values of \( \sigma \) and \( \tau \). Observe the cloud of points in the scatterplot and compare the empirical density functions to the probability density functions.

- \( \rho \in \{0, 0.3, 0.5, 0.7, 1\} \)
- \( \rho \in \{-0.3, -0.5, -0.7, -1\} \)

You may be perplexed by the lack of symmetry in how \( (X, Y) \) is defined in terms of \( (Z, W) \) in the definition. Note however that the distribution is completely determined by the 5 parameters. If we define \( X^\prime = \mu + \sigma \rho Z + \sigma \sqrt{1 - \rho^2} W \) and \( Y^\prime = \nu + \tau Z \) then \( (X^\prime, Y^\prime) \) has the same distribution as \( (X, Y) \), namely the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \) (although, of course \( (X^\prime, Y^\prime) \) and \( (X, Y) \) are different random vectors). There are other ways to define the same distribution as an affine transformation of \( (Z, W) \)—the situation will be clarified in the next subsection.

Suppose that \( (X, Y) \) has the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \). Then \( (X, Y) \) has moment generating function \( M \) given by \[ M(s, t) = \E\left[\exp\left(sX + tY\right)\right] = \exp\left[\E(s X + t Y) + \frac{1}{2} \var(s X + t Y)\right] = \exp\left[\mu s + \nu t + \frac{1}{2} \left( \sigma^2 s^2 + 2 \rho \sigma \tau s t + \tau^2 t^2\right) \right], \quad (s, t) \in \R^2 \]

Using the representation of \( (X, Y) \) in terms of the standard bivariate normal vector \( (Z, W) \) in the definition and collecting terms gives \[ M(s, t) = \E\left(\exp\left[(\mu s + \nu t) + (\sigma s + \rho \tau t) Z + \tau \sqrt{1 - \rho^2} t W\right] \right)\] Hence from independence we have \[ M(s, t) = \exp(\mu s + \nu t) m(\sigma s + \rho \tau t) m\left(\tau \sqrt{1 - \rho^2} t\right) \] where \( m \) is the standard normal MGF. Substituting and simplifying gives the result.

Like its univariate counterpart, the family of bivariate normal distributions is preserved under two types of transformations on the underlying random vector: affine transformations and sums of independent vectors. We start with a preliminary result on affine transformations that should help clarify the definition above

Suppose that \( (Z, W) \) has the standard bivariate normal distribution. Let \( X = a_1 + b_1 Z + c_1 W \) and \( Y = a_2 + b_2 Z + c_2 W \) where the coefficients are in \( \R \) and \( b_1 c_2 - c_1 b_2 \ne 0 \). Then \( (X, Y) \) has a bivariate normal distribution with parameters given by

- \( \E(X) = a_1 \)
- \( \E(Y) = a_2 \)
- \( \var(X) = b_1^2 + c_1^2 \)
- \( \var(Y) = b_2^2 + c_2^2 \)
- \( \cov(X, Y) = b_1 b_2 + c_1 c_2 \)

A direct proof using the change of variables formula is possible, but our goal is to show that \( (X, Y) \) can be written in the form given above in the definition. First, parts (a)–(e) follow from basic properties of expected value, variance, and covariance. So, in the notation used in the definition, we have \( \mu = a_1 \), \( \nu = a_2 \), \( \sigma = \sqrt{b_1^2 + c_1^2} \), \( \tau = \sqrt{b_2^2 + c_2^2} \), and \[ \rho = \frac{b_1 b_2 + c_1 c_2}{\sqrt{b_1^2 + c_1^2} \sqrt{b_2^2 + c_2^2}} \] (Note from the assumption on the coefficients that \( b_1^2 + c_1^2 \gt 0 \) and \( b_2^2 + c_2^2 \gt 0 \)). Simple algebra shows that \[ \sqrt{1 - \rho^2} = \frac{b_1 c_2 - c_1 b_2}{\sqrt{b_1^2 + c_1^2} \sqrt{b_2^2 + c_2^2}} \] Next we define \begin{align} U & = \frac{b_1 Z + c_1 W}{\sqrt{b_1^2 + c_1^2}} \\ V & = \frac{c_1 Z - b_1 W}{\sqrt{b_1^2 + c_1^2}} \end{align} The transformation that defines \( (u, v) \) from \( (z, w) \) is its own inverse, and has Jacobian 1. Hence it follows that \( (U, V) \) has the same joint distribution as \( (Z, W) \), namely the standard bivariate normal distribution. Simple algebra shows that \begin{align} X & = a_1 + \sqrt{b_1^2 + c_1^2} U = \mu + \sigma U \\ Y & = a_2 + \frac{b_1 b_2 + c_1 c_2}{\sqrt{b_1^2 + c_1^2}} U + \frac{c_1 b_2 - b_1 c_2}{\sqrt{b_1^2 + c_1^2}} V = \nu + \tau \rho U + \tau \sqrt{1 - \rho^2} V \end{align} This is the form given in the definition above, so it follows that \( (X, Y) \) has a bivariate normal distribution.

Now it is easy to show more generally that the bivariate normal distribution is closed with respect to affine transformations.

Suppose that \( (X, Y) \) has the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \). Define \(S = a_1 + b_1 X + c_1 Y\) and \(T = a_2 + b_2 X + c_2 Y\), where the coefficients are in \(\R\) and \(b_1 c_2 - c_1 b_2 \ne 0\). Then \((S, T)\) has a bivariate normal distribution with parameters as follows:

- \(\E(S) = a_1 + b_1 \mu + c_1 \nu\)
- \(\E(T) = a_2 + b_2 \mu + c_2 \nu\)
- \(\var(S) = b_1^2 \sigma^2 + c_1^2 \tau^2 + 2 b_1 c_1 \rho \sigma \tau\)
- \(\var(T) = b_2^2 \sigma^2 + c_2^2 \tau^2 + 2 b_2 c_2 \rho \sigma \tau\)
- \(\cov(S, T) = b_1 b_2 \sigma^2 + c_1 c_2 \tau^2 + (b_1 c_2 + b_2 c_1) \rho \sigma \tau \)

From our original construction we can write \( X = \mu + \sigma Z \) and \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \) where \( (Z, W) \) has the standard bivariate normal distribution. Then by simple substitution, \( S = A_1 + B_1 Z + C_1 W \) and \( T = A_2 + B_2 Z + C_2 W \) where \(A_i = a_i + b_i \mu + c_i\nu \), \(B_i = b_i \sigma + c_i \tau \rho \), \(C_i = c_i \tau \sqrt{1 - \rho^2} \) for \( i \in \{1, 2\} \). By simple algebra, \[ B_1 C_2 - C_1 B_2 = \sigma \tau \sqrt{1 - \rho^2}(b_1 c_2 - c_1 b_2) \ne 0 \] Hence \( (S, T) \) has a bivariate normal distribution from the previous theorem. Parts (a)–(e) follow from basic properties of expected value, variance, and covariance.

The bivariate normal distribution is preserved with respect to sums of independent variables.

Suppose that for \( i \in \{1, 2\} \), \( (X_i, Y_i) \) has the bivariate normal distribution with parameters \( (\mu_i, \nu_i, \sigma_i, \tau_i, \rho_i) \) and that \( (X_1, Y_1) \) and \( (X_2, Y_2) \) are independent. Then \( (X_1 + X_2, Y_1 + Y_2) \) has the bivariate normal distribution with parameters given by

- \( \E(X_1 + X_2) = \mu_1 + \mu_2 \)
- \( \E(Y_1 + Y_2) = \nu_1 + \nu_2 \)
- \( \var(X_1 + X_2) = \sigma_1^2 + \sigma_2^2 \)
- \( \var(Y_1 + Y_2) = \tau_1^2 + \tau_2^2 \)
- \( \cov(X_1 + X_2, Y_1 + Y_2) \) = \( \rho_1 \sigma_1 \tau_1 + \rho_2 \sigma_2 \tau_2 \)

Let \( M_i \) denote the MGF of \( (X_i, Y_i) \) for \( i \in \{1, 2\} \) and let \( M \) denote the MGF of \( (X_1 + X_2, Y_1 + Y_2) \). By independence, \( M(s, t) = M_1(s, t) M_2(s, t) \) for \( (s, t) \in \R^2 \). Using the MGF of the bivariate normal distribution above, and basic properties of the exponential function, \[ M(s, t) = \exp\left(\E\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right] + \frac{1}{2} \var\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right]\right), \quad (s, t) \in \R^2 \] Of course from basic properties of expected value, variance, and covariance, \begin{align} \E\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right] & = s(\mu_1 + \mu_2) + t(\nu_1 + \nu_2) \\ \var\left[s(X_1 + X_2) + t(Y_1 + Y_2)\right] & = s(\sigma_1^2 + \sigma_2^2) + t(\tau_1^2 + \tau_2^2) + 2 s t (\rho_1 \sigma_1 \tau_1 + \rho_2 \sigma_2 \tau_2) \end{align} Substituting gives the result.

Normal distributions are also perserved under conditioning.

Suppose that \( (X, Y) \) has the bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \). Then for \( x \in \R \), the conditional distribution of \(Y\) given \(X = x\) is normal with mean and variance given by

- \(\E(Y \mid X = x) = \nu + \rho \tau \frac{x - \mu}{\sigma}\)
- \(\var(Y \mid X = x) = \tau^2 \left(1 - \rho^2\right)\)

The conditional PDF of \( Y \) given \( X = x \) is \( f(x, y) \big/ g(x) \) where \( f \) is the joint PDF given above, and where \( g \) is the PDF of \( X \), namely the normal PDF with mean \( \mu \) and standard deviation \( \sigma \). The result then follows after some algebra.

From the definition of \((X, Y)\) in terms of the standard bivariate normal vector \( (Z, W) \) note that \[ Y = \nu + \rho \tau \frac{X - \mu}{\sigma} + \tau \sqrt{1 - \rho^2} W \] Since that \(X\) and \(W\) are independent, the result follows by substitution and properties of the normal distribution.

Note that the conditional variance does not depend on \(x\)

In the bivariate normal experiment, set the standard deviation of \(X\) to 1.5, the standard deviation of \(Y\) to 0.5, and the correlation to 0.7.

- Run the experiment 100 times.
- For each run, compute \(\E(Y \mid X = x)\) the predicted value of \(Y\) for the given the value of \(X\).
- Over all 100 runs, compute the square root of the average of the squared errors between the predicted value of \(Y\) and the true value of \(Y\).

The following result is important in the simulation of normal variables.

Suppose that \( (Z, W) \) has the standard bivariate normal distribution. Define the polar coordinates \((R, \Theta)\) of \((Z, W)\) by the equations \(Z = R \, \cos(\Theta)\), \(W = R \, \sin(\Theta)\) where \(R \ge 0\) and \(0 \le \Theta \lt 2 \, \pi\). Then

- \(R\) has probability density function \(g(r) = r \, e^{-\frac{1}{2} r^2}\) for \(r \in [0, \infty)\).
- \(\Theta\) is uniformly distributed on \([0, 2 \, \pi)\).
- \(R\) and \(\Theta\) are independent.

The Jacobian of the polar coordinate transformation that gives \( (z, w) \) from \( (r, \theta) \) is \( r \), as we all remember from calculus. Hence by the change of variables theorem, the PDF \( g \) of \( (R, \Theta) \) in terms of the PDF \( \phi_2 \) of \( (Z, W) \) is given by \[ g(r, \theta) = \phi_2\left(r \cos(\theta), r \sin(\theta)\right) r = \frac{1}{2 \pi} r e^{-r^2 /2}, \quad (r, \theta) \in [0, \infty) \times [0, 2 \pi) \] The result then follows from the factorization theorem for independent random variables.

The distribution of \(R\) is known as the standard Rayleigh distribution, named for William Strutt, Lord Rayleigh. The Rayleigh distribution studied in more detail in a separate section.

Since the quantile function \( \Phi^{-1} \) of the normal distribution cannot be given in a simple, closed form, we cannot use the usual quantile method of simulating a normal random variable. However, the quantile method works quite well to simulate a Rayleigh variable, and of course simulating uniform variables is trivial. Hence we have a way of simulating a standard bivariate normal vector with a pair of random numbers (which, you will recall are independent random variables, each with the standard uniform distribution, that is, the uniform distribution on \( [0, 1) \)).

Suppose that \( U \) and \( V \) are independent random variables, each with the standard uniform distribution. Let \( R = \sqrt{-2 \ln(U)} \) and \( \Theta = 2 \pi V \). Define \( Z = R \cos(\Theta) \) and \( W = R \sin(\Theta) \). Then \( (Z, W) \) has the standard bivariate normal distribution.

The Rayleigh distribution function is \( F(r) = 1 - e^{-r^2 / 2} \) for \( r \in [0, \infty) \) and hence the quantile function is \( F^{-1}(p) = \sqrt{-2 \ln(1 - p)} \) for \( p \in [0, 1) \). Hence if \( U \) has the standard uniform distribution, then \( \sqrt{-2 \ln (1 - U)} \) has the Rayleigh distribution. But \( 1 - U \) also has the standard uniform distribution so \( R = \sqrt{-2 \ln(U)} \) aslo has the Rayleigh distribution. If \( V \) has the standard uniform distribution then of course \( 2 \pi V \) is uniformly distributed on \( [0, 2 \pi) \). If \( U \) and \( V \) are independent, then so are \( R \) and \( \Theta \). By the previous theorem, if \( Z = R \cos(\Theta) \) and \( W = R \sin(\Theta) \), then \( (Z, W) \) has the standard bivariate normal distribution.

Of course, if we can simulate \( (Z, W) \) with a standard bivariate normal distribution, then we can simulate \( (X, Y) \) with the general bivariate normal distribution, with parameter \( (\mu, \nu, \sigma, \tau, \rho) \) by the definition: \( X = \mu + \sigma Z \), \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \).

The general multivariate normal distribution is a natural generalization of the bivariate normal distribution studied above. The exposition is very compact and elegant using expected value and covariance matrices, and would be horribly complex without these tools. Thus, this section requires some prerequisite knowledge of linear algebra. In particular, recall that \(\boldsymbol{A}^T\) denotes the transpose of a matrix \(\boldsymbol{A}\) and that we identify a vector in \(\R^n\) with the \(n \times 1\) column vector.

Suppose that \(\bs{Z} = (Z_1, Z_2, \ldots, Z_n)\) is a vector of independent random variables, each with the standard normal distribution. Then \(\bs{Z}\) is said to have the \(n\)-dimensional standard normal distribution.

\(\E(\bs{Z}) = \bs{0}\) (the zero vector in \(\R^n\)).

\(\vc(\bs{Z}) = I\) (the \(n \times n\) identity matrix).

\(\bs{Z}\) has probability density function \[ \phi_n(\bs{z}) = \frac{1}{(2 \pi)^{n/2}} \exp \left( -\frac{1}{2} \bs{z} \cdot \bs{z} \right) = \frac{1}{(2 \pi)^{n/2}} \exp\left(-\frac{1}{2} \sum_{i=1}^n z_i^2\right), \quad \bs{z} = (z_1, z_2, \ldots, z_n) \in \R^n \]

By independence, \( \phi_n(\bs{z}) = \phi(z_1) \phi(z_2) \cdots \phi(z_n) \).

\(\bs{Z}\) has moment generating function given by \[ \E\left[\exp(\bs{t} \cdot \bs{Z})\right] = \exp\left[\frac{1}{2} \var(\bs{t} \cdot \bs{Z})\right] = \exp \left( \frac{1}{2} \bs{t} \cdot \bs{t} \right) = \exp\left(\frac{1}{2} \sum_{i=1}^n t_i^2\right), \quad \bs{t} = (t_1, t_2, \ldots, t_n) \in \R^n \]

By independence, \( \E\left[\exp(\bs{t} \cdot \bs{Z})\right] = m(t_1) m(t_2) \cdots m(t_n) \) where \( m \) is the standard normal MGF.

Suppose that \(\bs{Z}\) has the \(n\)-dimensional standard normal distribution. Suppose also that \(\bs{\mu} \in \R^n\) and that \(\bs{A} \in \R^{n \times n}\) is invertible. The random vector \(\bs{X} = \bs{\mu} + \bs{A} \, \bs{Z}\) is said to have an \(n\)-dimensional normal distribution.

\(\E(\bs{X}) = \bs{\mu}\).

From the linear property of expected value, \( \E(\bs{X}) = \bs{\mu} + \E(\bs{Z}) = \bs{\mu} \)

\(\vc(\bs{X}) = \bs{A} \, \bs{A}^T\).

From basic properties of the variance-covariance matrix, \( \vc(\bs{X}) = \bs{A} \bs{A}^T \vc(\bs{Z}) = \bs{A} \bs{A}^T \)

Let \(\bs{V} = \vc(\bs{X}) = \bs{A} \, \bs{A}^T\), and recall that \( \bs{V} \) is symmetric and positive definite (and hence also invertible). We will now see that the multivariate normal distribution is completely determined by the expected value vector \( \bs{\mu} \) and the variance-covariance matrix \( \bs{V} \), and hence these give the basic parameters of the distribution.

\(\bs{X}\) has probability density function

\[ f(\bs{x}) = \frac{1}{(2 \, \pi)^{n/2} \sqrt{\det(\bs{V})}} \exp \left[ -\frac{1}{2} (\bs{x} - \bs{\mu}) \cdot \bs{V}^{-1} (\bs{x} - \bs{\mu}) \right], \quad \bs{x} \in \R^n \]The inverse of the transformation \( \bs{x} = \bs{\mu} + \bs{A} \bs{z} \) is \( \bs{z} = \bs{A}^{-1}(\bs{x} - \bs{\mu}) \) and hence the Jacobian of the inverse transformation is \( \det\left(\bs{A}^{-1}\right) = 1 \big/ \det(\bs{A}) \). Using the multivariate change of variables theorem, \[ f(\bs{x}) = \frac{1}{\left|\det(\bs{A})\right|} \phi_n\left[\bs{A}^{-1}(\bs{x} - \bs{\mu})\right] = \frac{1}{(2 \pi)^{n/2} \left|\det(\bs{A})\right|} \exp\left[-\frac{1}{2} \bs{A}^{-1}(\bs{x} - \bs{\mu}) \cdot \bs{A}^{-1}(\bs{x} - \bs{\mu})\right], \quad \bs{x} \in \R^n \] But \( \det(\bs{V}) = \det\left(\bs{A} \bs{A}^T\right) = \det(\bs{A}) \det\left(\bs{A}^T\right) = \left[\det(\bs{A})\right]^2 \) and hence \( \left|\det(\bs{A})\right| = \sqrt{\det(\bs{V})} \). Also, \begin{align} \bs{A}^{-1}(\bs{x} - \bs{\mu}) \cdot \bs{A}^{-1}(\bs{x} - \bs{\mu}) & = \left[\bs{A}^{-1}(\bs{x} - \bs{\mu})\right]^T \bs{A}^{-1}(\bs{x} - \bs{\mu}) = (\bs{x} - \bs{\mu})^T \left(\bs{A}^{-1}\right)^T \bs{A}^{-1} (\bs{x} - \bs{\mu}) \\ & = (\bs{x} - \bs{\mu})^T \left(\bs{A}^T\right)^{-1} \bs{A}^{-1} (\bs{x} - \bs{\mu}) = (\bs{x} - \bs{\mu})^T \left(\bs{A} \bs{A}^T\right)^{-1} (\bs{x} - \bs{\mu})\\ & = (\bs{x} - \bs{\mu})^T \bs{V}^{-1} (\bs{x} - \bs{\mu}) = (\bs{x} - \bs{\mu}) \cdot \bs{V}^{-1} (\bs{x} - \bs{\mu}) \end{align}

\(\bs{X}\) has moment generating function given by \[ \E\left[\exp(\bs{t} \cdot \bs{X})\right] = \exp\left[\E(\bs{t} \cdot \bs{X}) + \frac{1}{2} \var(\bs{t} \cdot \bs{X})\right] = \exp \left( \bs{t} \cdot \bs{\mu} + \frac{1}{2} \bs{t} \cdot \bs{V} \, \bs{t} \right), \quad \bs{t} \in \R^n \]

From the representation \( \bs{X} = \bs{\mu} + \bs{A} \bs{X} \) we have \(\E\left[\exp(\bs{t} \cdot \bs{X}\right] = \exp(\bs{t} \cdot \bs{\mu}) \E\left[\exp(\bs{t} \cdot \bs{A} \bs{Z})\right] \). But \( \bs{t} \cdot \bs{A} \bs{Z} = \left(\bs{A}^T \bs{t}\right) \cdot Z \) so using the MGF of \( \bs{Z} \) given above, \[ \E\left[\exp(\bs{t} \cdot \bs{A} \bs{Z})\right] = \exp\left[\frac{1}{2} \left(\bs{A}^T \bs{t}\right) \cdot \left(\bs{A}^T \bs{t}\right)\right] = \exp\left[\frac{1}{2} \bs{t}^T \bs{A} \bs{A}^T \bs{t}\right] = \exp\left[\frac{1}{2} \bs{t} \cdot \bs{V} \bs{t}\right] \]

Of course, the moment generating function completely determines the distribution. Thus, if a random vector \( \bs{X} \) in \( \R^n \) has a moment generating function of the form given above, for some \( \bs{\mu} \in \R^n \) and symmetric, positive definite \( \bs{V} \in \R^{n \times n} \), then \( \bs{X} \) has the \( n \)-dimensional normal distribution with mean \( \bs{\mu} \) and variance-covariance matrix \( \bs{V} \).

Note again that in the representation \( \bs{X} = \bs{\mu} + \bs{A} \bs{Z} \), the distribution of \( \bs{X} \) is uniquely determined by the expected value vector \( \bs{\mu} \) and the variance-covariance matrix \(\bs{V = \bs{A} \bs{A}^T}\), but not by \( \bs{\mu} \) and \( \bs{A} \). In general, for a given positive definite matrix \(\bs{V}\), there are many invertible matrices \(\bs{A}\) such that \( \bs{V} = \bs{A} \bs{A}^T\). A theorem in matrix theory states that there is a unique lower triangular matrix \(\bs{L}\) with this property. The representation \( \bs{X} + \bs{\mu} + \bs{L} \bs{Z} \) is the canonical representation of \( \bs{X} \).

If \( \bs{X} = (X, Y) \) has bivariate normal distribution with parameters \( (\mu, \nu, \sigma, \tau, \rho) \), then the lower triangular matrix \( \bs{L} \) such that \( \bs{L} \bs{L}^T = \vc(\bs{X}) \) is \[ \bs{L} = \left[\begin{matrix} \sigma & 0 \\ \tau \rho & \tau \sqrt{1 - \rho^2} \end{matrix} \right] \]

Note that \[ \bs{L} \bs{L}^T = \left[\begin{matrix} \sigma^2 & \sigma \tau \rho \\ \sigma \tau \rho & \tau^2 \end{matrix}\right] = \vc(X, Y) \]

Note that the matrix \( \bs{L} \) above gives the canonical representation of \( (X, Y) \) in terms of the standard normal vector \( (Z, W) \): \( X = \mu + \sigma Z \), \( Y = \nu + \tau \rho Z + \tau \sqrt{1 - \rho^2} W \).

If the matrix \( \bs{A} \in \R^{n \times n} \) in the definition is not invertible, then the variance-covariance matrix \( \bs{V} = \bs{A} \bs{A}^T\) is symmetric, but only positive semi-definite. The random vector \( \bs{X} = \bs{\mu} + \bs{A} \bs{Z} \) takes values in a lower dimensional affine subspace of \( \R^n \) that has measure 0 relative to \( n \)-dimensional Lebesgue measure \( \lambda_n \). Thus, \( \bs{X} \) does not have a probability density function relative to \( \lambda_n \), and so the distribution is degenerate. However, the formula for the moment generating function still holds. Degenerate normal distributions are discussed in more detail below.

The multivariate normal distribution is invariant under two basic types of transformations on the underlying random vectors: affine transformations (with linearly independent rows), and concatenation of independent vectors. As simple corollaries of these two results, the normal distribution is also invariant with respect to subsequences of the random vector, re-orderings of the terms in the random vector, and sums of independent random vectors. The main tool that we will use is the moment generating function. We start with the first main result on affine transformations.

Suppose that \(\bs{X}\) has the \(n\)-dimensional normal distribution with mean vector \(\bs{\mu}\) and variance-covariance matrix \(\bs{V}\). Suppose also that \(\bs{a} \in \R^m\) and that \(\bs{A} \in \R^{m \times n}\) has linearly independent rows (thus, \(m \le n\)). Then \(\bs{Y} = \bs{a} + \bs{A} \, \bs{X}\) has an \(m\)-dimensional normal distribution, with

- \(\E(\bs{Y}) = \bs{a} + \bs{A} \bs{\mu}\)
- \(\vc(\bs{Y}) = \bs{A} \bs{V} \bs{A}^T\)

For \( \bs{t} \in \R^m \), \( \E\left[\exp(\bs{t} \cdot \bs{Y})\right] = \exp(\bs{t} \cdot \bs{a}) \E\left[\bs{t} \cdot \bs{A} \bs{X}\right] \). but \( \bs{t} \cdot \bs{A} \bs{X} = \left(\bs{A}^T \bs{t}\right) \cdot \bs{X} \), so using the MGF of \( \bs{X} \) above, we have \[ \E\left[\exp(\bs{t} \cdot \bs{A} \bs{X})\right] = \exp\left[\left(\bs{A}^T \bs{t}\right) \cdot \bs{\mu} + \frac{1}{2}\left(\bs{A}^T \bs{t}\right) \cdot \bs{V} \left(\bs{A}^T \bs{t}\right)\right] \] But \( \left(\bs{A}^T \bs{t}\right) \cdot \bs{\mu} = \bs{t} \cdot \bs{A} \bs{\mu} \) and \(\left(\bs{A}^T \bs{t}\right) \cdot \bs{V} \left(\bs{A}^T \bs{t}\right) = \bs{t} \cdot \left(\bs{A} \bs{V} \bs{A}^T\right) \bs{t}\), so letting \( \bs{b} = \bs{a} + \bs{A} \bs{\mu} \) and \( \bs{U} = \bs{A} \bs{V} \bs{A}^T\) and putting the pieces together, we have \( \E\left[\exp( \bs{t} \cdot \bs{Y})\right] = \exp\left[ \bs{b} \cdot \bs{t} + \frac{1}{2} \bs{t} \cdot \bs{U} \bs{t} \right] \).

A clearly important special case is \( m = n \), which generalizes the original definition of the multivariate normal distribution. Thus, if \( \bs{a} \in \R^n \) and \( \bs{A} \in \R^{n \times n}\) is invertible, then \( \bs{Y} = \bs{a} + \bs{A} \bs{X}\) has an \( n \)-dimensional normal distribution. Here are some other corollaries:

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) has an \(n\)-dimensional normal distribution. If \( \{i_1, i_2, \ldots, i_m\} \) is a set of distinct indices, then \( \bs{Y} = \left(X_{i_1}, X_{i_2}, \ldots, X_{i_m}\right) \) has an \( m \)-dimensional normal distribution.

Let \( A \in \R^{m \times n}\) be the matrix defined by the condition that for \( m \in \{1, 2, \ldots, m\} \), row \( j \) has 1 in position \( i_j \) and has 0 in all other positions. Then \( \bs{A} \) has linearly independent rows (since the \( i_j \) are distinct in \( j \)) and \( \bs{Y} = \bs{A} \bs{X} \). Thus the result follows from the fundamental result above on affine transformations.

In the context of the previous result, if \( \bs{X} \) has mean vector \( \bs{\mu} \) and variance-covariance matrix \( \bs{V} \), then \( \bs{Y} \) has mean vector \( \bs{A} \bs{\mu} \) and variance-covariance matrix \( \bs{A} \bs{V} \bs{A}^T \), where \( \bs{A} \) is the 0-1 matrix defined in the proof. As simple corollaries, note that if \( \bs{X} = (X_1, x_2, \ldots, X_n) \) has an \( n \)-dimensional normal distribution, then any permutation of the coordinates of \( \bs{X} \) also has an \( n \)-dimensional normal distribution, and \( (X_1, X_2, \ldots, X_m) \) has an \( m \)-dimensional normal distribution for any \( m \le n \). Here is a slight extension of the last statement.

Suppose that \(\bs{X}\) is a random vector in \(\R^m\), \(\bs{Y}\) is a random vector in \(\R^n\), and that \((\bs{X}, \bs{Y})\) has an \((m + n)\)-dimensional normal distribution. Then

- \( \bs{X} \) has an \( m \)-dimensional normal distribution.
- \( \bs{Y} \) has an \( n \)-dimensional normal distribution.
- \(\bs{X}\) and \(\bs{Y}\) are independent if and only if \(\cov(\bs{X}, \bs{Y}) = \bs{0}\) (the \(m \times n\) zero matrix).

As we already noted, parts (a) and (b) are a simple consequence of the result above on sub vectors of a normally distributed vector. Thus, we just need to verify (c). In block form, note that \[ \vc(\bs{X}, \bs{Y}) = \left[\begin{matrix} \vc(\bs{X}) & \cov(\bs{X}, \bs{Y}) \\ \cov(\bs{Y}, \bs{X}) & \vc(\bs{Y})\end{matrix} \right] \] Now let \( M \) denote the moment generating function of \( (\bs{X}, \bs{Y}) \), \( M_1 \) the MGF of \( \bs{X} \), and \( M_2 \) the MGF of \( \bs{Y} \). From the form of the MGF above, note that \( M(\bs{s}, \bs{t}) = M_1(\bs{s}) M_2(\bs{t}) \) for all \( \bs{s} \in \R^m \), \( \bs{t} \in \R^n \) if and only if \( \cov(\bs{X}, \bs{Y}) = \bs{0} \), the \( m \times n \) zero matrix.

Next is the converse to part (c) of the previous result: concatenating independent normally distributed vectors produces another normally distributed vector.

Suppose that \( \bs{X} \) has the \( m \)-dimensional normal distribution with mean vector \( \bs{\mu} \) and variance-covariance matrix \( \bs{U} \), \( \bs{Y} \) has the \( n \)-dimensional normal distribution with mean vector \( \bs{\nu} \) and variance-covariance matrix \( \bs{V} \), and that \( \bs{X} \) and \( \bs{Y} \) are independent. Then \( \bs{Z} = (\bs{X}, \bs{Y})\) has the \(m + n\)-dimensional normal distribution with

- \( \E(\bs{X}, \bs{Y}) = (\bs{\mu}, \bs{\nu}) \)
- \( \vc(\bs{X}, \bs{Y}) = \left[\begin{matrix} \vc(\bs{X}) & \bs{0} \\ \bs{0}^T & \vc(\bs{Y})\end{matrix}\right]\) where \( \bs{0} \) is the \( m \times n \) zero matrix.

For \( \bs{t} \in \R^{m + n} \), write \( \bs{t} \) in block form as \( \bs{t} = (\bs{r}, \bs{s}) \) where \( \bs{r} \in \R^m \) and \(\bs{s} \in \R^n\). By independence, the MGF of \( (\bs{X}, \bs{Y}) \) is \[ \E\left(\exp\left[\bs{t} \cdot (\bs{X}, \bs{Y})\right]\right) = \E\left[\bs{r} \cdot \bs{X} + \bs{s} \cdot \bs{Y}\right] = \E\left[\exp(\bs{r} \cdot \bs{X})\right] \E\left[\exp(\bs{s} \cdot \bs{Y})\right]\] Using the formula for the normal MGF above we have \[ \E\left(\exp\left[\bs{t} \cdot (\bs{X}, \bs{Y})\right]\right) = \exp \left( \bs{r} \cdot \bs{\mu} + \frac{1}{2} \bs{r} \cdot \bs{U} \, \bs{r} \right) \exp \left( \bs{s} \cdot \bs{\nu} + \frac{1}{2} \bs{s} \cdot \bs{V} \, \bs{s} \right) = \exp\left[(\bs{r} \cdot \bs{\mu} + \bs{s} \cdot \bs{\nu}) + \frac{1}{2} (\bs{r} \cdot \bs{U} \bs{r} + \bs{s} \cdot \bs{V} \bs{s})\right] \] But \( \bs{r} \cdot \bs{\mu} + \bs{s} \cdot \bs{\nu} = \bs{t} \cdot (\bs{\mu}, \bs{\nu}) \) and \( \bs{r} \cdot \bs{U} \bs{r} + \bs{s} \cdot \bs{V} \bs{s} = \bs{t} \cdot \left[\begin{matrix} \vc(\bs{X}) & \bs{0} \\ \bs{0}^T & \vc(\bs{Y})\end{matrix}\right] \bs{t} \) so the proof is complet

Just as in the univariate case, the normal family of distributions is closed with respect to sums of independent variables. The proof follows easily from the previous result.

Suppose that \( \bs{X} \) has the \( n \)-dimensional normal distribution with mean vector \( \bs{\mu} \) and variance-covariance matrix \( \bs{U} \), \( \bs{Y} \) has the \( n \)-dimensional normal distribution with mean vector \( \bs{\nu} \) and variance-covariance matrix \( \bs{V} \), and that \( \bs{X} \) and \( \bs{Y} \) are independent. Then \( \bs{X} + \bs{Y}\) has the \( n \)-dimensional normal distribution with

- \( \E(\bs{X} + \bs{Y}) = \bs{\mu} + \bs{\nu}\)
- \( \vc(\bs{X} + \bs{Y}) = \bs{U} + \bs{V} \)

From the previous result \( (\bs{X}, \bs{Y}) \) has a \( 2 n \)-dimensional normal distribution. Moreover, \( \bs{X} + \bs{Y} = \bs{A}(\bs{X}, \bs{Y}) \) where \( \bs{A} \) is the \( n \times 2 n \) matrix defined by the condition that for \( i \in \{1, 2, \ldots, n\} \), row \( i \) has 1 in positions \( i \) and \( n + i \) and \( 0 \) in all other positions. The matrix \( A \) has linearly independent rows and thus the result follows from the fundamental result above on affine transformations. Parts (a) and (b) are standard results for sums of independent random vectors.

We close with a trivial corollary to the general result above on affine transformations, but this corollary points the way to a further generalization of the multivariate normal distribution that includes the degenerate distributions.

Suppose that \(\bs{X}\) has an \(n\)-dimensional normal distribution with mean vector \(\bs{\mu}\) and variance-covariance matrix \(\bs{V}\), and that \(\bs{a} \in \R^n\) with \( \bs{a} \ne \bs{0} \). Then \(Y = \bs{a} \cdot \bs{X} \) has a (univariate) normal distribution with

- \(\E(Y) = \bs{a} \cdot \bs{\mu}\)
- \(\var(Y) = \bs{a} \cdot \bs{V} \bs{a}\)

Note again that \( \bs{a} \cdot \bs{X} = \bs{a}^T \bs{X}\). Since \( \bs{a} \ne \bs{0} \), the single row of \( \bs{a}^T \) is linearly independent and hence the result follows from the general result above on affine transformations.

The last result can be used to give a simple, elegant definition of the multivariate normal distribution that includes the degenerate distributions as well as the ones we have considered so far. First we will adopt our general definition of the univariate normal distribution that includes constant random variables.

A random variable \( \bs{X} \) that takes values in \( \R^n \) has an \( n \)-dimensional normal distribution if and only if \( \bs{a} \cdot \bs{X} \) has a univariate normal distribution for every \( \bs{a} \in \R^n \).

Although an \( n \)-dimensional normal distribution may not have a probability density function with respect to \( n \)-dimensional Lebesgue measure \( \lambda_n \), the form of the moment generating function is unchanged.

Suppose that \( \bs{X} \) has mean vector \( \bs{\mu} \) and variance-covariance matrix \( \bs{V} \), and that \( \bs{X} \) has an \( n \)-dimensional normal distribution. The moment generating function of \(\bs{X}\) is given by \[ \E\left[\exp(\bs{t} \cdot \bs{X})\right] = \exp\left[\E(\bs{t} \cdot \bs{X}) + \frac{1}{2} \var(\bs{t} \cdot \bs{X})\right] = \exp \left( \bs{t} \cdot \bs{\mu} + \frac{1}{2} \bs{t} \cdot \bs{V} \, \bs{t} \right), \quad \bs{t} \in \R^n \]

If \( \bs{t} \in \R^n \), then by definition, \( \bs{t} \cdot \bs{X} \) has a univariate normal distribution. Thus \( \E\left[\exp(\bs{t} \cdot \bs{X})\right] \) is simply the moment generating function of \( \bs{t} \cdot \bs{X} \), evaluated at the argument 1. The results then follow from the univariate MGF.

Our new general definition really is a generalization.

Suppose that \( \bs{X} \) has an \( n \)-dimensional normal distribution in the sense of this subsection, and that the distribution of \( \bs{X} \) has a probability density function on \( \R^n \) with respect to Lebesgue measure \( \lambda_n \). Then \( \bs{X} \) has an \( n \)-dimensional normal distribution in our previous sense.

This follows from our previous results, since both the MGF and the PDF completely determine the distribution of \( \bs{X} \).