\(\newcommand{\var}{\text{var}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\newcommand{\vc}{\text{vc}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\bs}{\boldsymbol}\)
  1. Virtual Laboratories
  2. 3. Expected Value
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6

6. Expected Value and Covariance Matrices

The main purpose of this section is a discussion of expected value and covariance for random matrices and vectors. These topics are particularly important in multivariate statistical models and the multivariate normal distribution. This section requires some prerequisite knowledge of linear algebra.

Basic Theory

We will let \(\R^{m \times n}\) denote the space of all \(m \times n\) matrices of real numbers. In particular, we will identify \(\R^n\) with \(\R^{n \times 1}\), so that an ordered \(n\)-tuple can also be thought of as an \(n \times 1\) column vector. The transpose of a matrix \(\bs{A}\) is denoted \(\bs{A}^T\). As usual, our starting point is a random experiment with a probability measure \(\P\) on an underlying sample space.

Expected Value of a Random Matrix

Suppose that \(\bs{X}\) is an \(m \times n\) matrix of real-valued random variables, whose \((i, j)\) entry is denoted \(X_{i\,j}\). Equivalently, \(\bs{X}\) can be thought of as a random \(m \times n\) matrix. It is natural to define the expected value \(\E(\bs{X})\) to be the \(m \times n\) matrix whose \((i, j)\) entry is \(\E(X_{i\,j})\), the expected value of \(X_{i\,j}\).

Many of the basic properties of expected value of random variables have analogies for expected value of random matrices, with matrix operation replacing the ordinary ones.

\(\E(\bs{X} + \bs{Y}) = \E(\bs{X}) + \E(\bs{Y})\) if \(\bs{X}\) and \(\bs{Y}\) are random \(m \times n\) matrices.

\(\E(\bs{A}\,\bs{X}) = \bs{A} \, \E(\bs{X})\) if \(\bs{A}\) is a non-random \(m \times n\) matrix and \(\bs{X}\) is a random \(n \times k\) matrix.

\(\E(\bs{X}\,\bs{Y}) = \E(\bs{X}) \, \E(\bs{Y})\) if \(\bs{X}\) is a random \(m \times n\) matrix, \(\bs{Y}\) is a random \(n \times k\) matrix, and \(\bs{X}\) and \(\bs{Y}\) are independent.

Covariance Matrices

Suppose now that \(\bs{X}\) is a random vector in \(\R^m\) and \(\bs{Y}\) is a random vector in \(\R^n\). The covariance matrix of \(\bs{X}\) and \(\bs{Y}\) is the \(m \times n\) matrix \(\cov(\bs{X}, \bs{Y})\) whose \((i,j)\) entry is \(\cov(X_i,Y_j)\) the covariance of \(X_i\) and \(Y_j\).

\(\cov(\bs{X}, \bs{Y}) = \E\{[\bs{X} - \E(\bs{X})][\bs{Y} - \E(\bs{Y})]^T\}\).

\(\cov(\bs{X},\bs{Y}) = \E(\bs{X}\,\bs{Y}^T) - \E(\bs{X})\,[\E(\bs{Y})]^T\).

\(\cov(\bs{Y}, \bs{X}) = [\cov(\bs{X}, \bs{Y})]^T\).

\(\cov(\bs{X}, \bs{Y}) = \bs{0}\) if and only if \(\cov(X_i, Y_j) = 0\) for each \(i\) and \(j\), so that each coordinate of \(\bs{X}\) is uncorrelated with each coordinate of \(\bs{Y}\) (in particular, this holds if \(\bs{X}\) and \(\bs{Y}\) are independent).

\(\cov(\bs{X} + \bs{Y}, \bs{Z}) = \cov(\bs{X}, \bs{Y}) + \cov(\bs{Y}, \bs{Z})\) if \(\bs{X}\) and \(\bs{Y}\) are random vectors in \(\R^m\) and \(\bs{Z}\) is a random vector in \(\R^n\).

\(\cov(\bs{X}, \bs{Y} + \bs{Z}) = \cov(\bs{X}, \bs{Y}) + \cov(\bs{X}, \bs{Z})\) if \(\bs{X}\) is a random vector in \(\R^m\), and \(\bs{Y}\) and \(\bs{Z}\) are random vectors in \(\R^n\).

\(\cov(\bs{A}\,\bs{X}, \bs{Y}) = \bs{A}\,\cov(\bs{X}, \bs{Y})\) if \(\bs{X}\) is a random vector in \(\R^m\), \(\bs{Y}\) is a random vector in \(\R^n\), and \(\bs{A}\) is a non-random matrix in \(\R^{k \times m}\).

\(\cov(\bs{X}, \bs{A}\,\bs{Y}) = \cov(\bs{X}, \bs{Y})\,\bs{A}^T\) if \(\bs{X}\) is a random vector in \(\R^m\), \(\bs{Y}\) is a random vector in \(\R^n\), and \(\bs{A}\) is a non-random matrix in \(\R^{k \times n}\).

Variance-Covariance Matrices

Suppose now that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random vector in \(\R^n\). The covariance matrix of \(\bs{X}\) with itself is called the variance-covariance matrix of \(\bs{X}\):

\[ \vc(\bs{X}) = \cov(\bs{X}, \bs{X}) \]

\(\vc(\bs{X})\) is a symmetric \(n \times n\) matrix with \((\var(X_1), \var(X_2), \ldots, \var(X_n))\) on the diagonal.

\(\vc(\bs{X} + \bs{Y}) = \vc(\bs{X}) + \cov(\bs{X}, \bs{Y}) + \cov(\bs{Y}, \bs{X}) + \vc(\bs{Y})\) if \(\bs{X}\) and \(\bs{Y}\) are random vectors in \(\R^n\).

\(\vc(\bs{A}\,\bs{X}) = \bs{A}\,\vc(\bs{X})\,\bs{A}^T\) if \(\bs{X}\) is a random vector in \(\R^n\) and \(\bs{A}\) is a non-random matrix in \(\R^{m \times n}\).

If \(\bs{a} \in \R^n\), note that \(\bs{a}^T\,\bs{X}\) is simply the inner product or dot product of \(\bs{a}\) with \(\bs{X}\), and is a linear combination of the coordinates of \(\bs{X}\):

\[ \bs{a}^T \bs{X} = \sum_{i=1}^n a_i X_i \]

\(\var(\bs{a}^T\,\bs{X}) = \bs{a}^T\,\vc(\bs{X})\,\bs{a}\) if \(\bs{X}\) is a random vector in \(\R^n\) and \(\bs{a} \in \R^n\). Thus conclude that \(\vc(\bs{X})\) is either positive semi-definite or positive definite. In particular, the eigenvalues and the determinant of \(\vc(\bs{X})\) are nonnegative.

\(\vc(\bs{X})\) is positive semi-definite (but not positive definite) if and only if there exists \(\bs{a} \in \R^n\) and \(c \in \R\) such that, with probability 1,

\[ \bs{a}^T \bs{X} = \sum_{i=1}^n a_i X_i = c \]

Thus, if \(\vc(\bs{X})\) is positive semi-definite, then one of the coordinates of \(\bs{X}\) can be written as a linear transformation of the other coordinates (and hence can usually be eliminated in the underlying model). By contrast, if \(\vc(\bs{X})\) is positive definite, then this cannot happen; \(\vc(\bs{X})\) has positive eigenvalues and determinant and is invertible.

Best Linear Predictors

Suppose again that \(\bs{X} = (X_1, X_2, \ldots, X_m)\) is a random vector in \(\R^m\) and that \(\bs{Y} = (Y_1, Y_2, \ldots, Y_n)\) is a random vector in \(\R^n\). We are interested in finding the linear (technically affine) function of \(\bs{X}\), \(\bs{a} + \bs{B}\,\bs{X}\) where \(\bs{a} \in \R^n\) and \(\bs{B} \in \R^{m \times n}\), that is closest to \(\bs{Y}\) in the mean square sense. This problem is of fundamental importance in statistics when random vector \(\bs{X}\), the predictor vector is observable, but not random vector \(\bs{Y}\), the response vector. Our discussion here generalizes the one-dimensional case, when \(X\) and \(Y\) are random variables. That problem was solved in the section on Covariance and Correlation. We will assume that \(\vc(\bs{X})\) is positive definite, so that none of the coordinates of \(\bs{X}\) can be written as an affine function of the other coordinates.

\(\E[\|\bs{Y} - (\bs{a} + \bs{B}\,\bs{X})\|^2]\) is minimized when \(\bs{B} = \cov(\bs{Y}, \bs{X})\,[\vc(\bs{X})]^{-1}\) and \(\bs{a} = \E(\bs{Y}) - \cov(\bs{Y}, \bs{X})\,[\vc(\bs{X})]^{-1}\,\E(\bs{X})\).

Thus, the linear function of \(\bs{X}\) that is closest to \(\bs{Y}\) in the mean square sense is the random vector

\[ L(\bs{Y}|\bs{X}) = \E(\bs{Y}) + \cov(\bs{Y},\bs{X})[\vc(\bs{X})]^{-1}[\bs{X} - \E(\bs{X})] \]

The function of \(\bs{x}\) given by

\[ L(\bs{Y}|\bs{X} = \bs{x}) = \E(\bs{Y}) + \cov(\bs{Y},\bs{X})[\vc(\bs{X})]^{-1}[\bs{x} - \E(\bs{X})] \]

is known as the (distribution) linear regression function. If we observe \(\bs{x}\) then \(L(\bs{Y} | \bs{X} = \bs{x})\) is our prediction of \(\bs{Y}\).

Non-linear regression with a single, real-valued predictor variable can be thought of as a special case of multiple linear regression. Thus, suppose that \(X\) is the predictor variable, \(Y\) is the response variable, and that \((g_1, g_2, \ldots, g_n)\) is a sequence of real-valued functions. We can apply the results of Exercise 17 to find the linear function of \((g_1(X), g_2(X), \ldots, g_n(X))\) that is closest to \(Y\) in the mean square sense. We just replace \(X_i\) with \(g_i(X)\) for each \(i\).

Examples and Applications

Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Find each of the following:

  1. \(\E(X, Y)\)
  2. \(\vc(X, Y)\)
Answer:
  1. \(\left(\frac{7}{12}, \frac{7}{12}\right)\)
  2. \(\left(\begin{matrix} \frac{11}{144} & -\frac{1}{144} \\ -\frac{1}{144} & \frac{11}{144}\end{matrix}\right)\)

Suppose that \((X, Y)\) has probability density function \(f(x, y) = 2\,(x + y)\) for \(0 \le x \le y \le 1\). Find each of the following:

  1. \(\E(X, Y)\)
  2. \(\vc(X, Y)\)
Answer:
  1. \(\left(\frac{5}{12}, \frac{3}{4}\right)\)
  2. \(\left(\begin{matrix} \frac{43}{720} & \frac{1}{48} \\ \frac{1}{48} & \frac{3}{80} \end{matrix} \right)\)

Suppose that \((X, Y)\) has probability density function \(f(x, y) = 6\,x^2\,y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Find each of the following:

  1. \(\E(X, Y)\)
  2. \(\vc(X, Y)\)
Answer:

Note that \(X\) and \(Y\) are independent.

  1. \(\left(\frac{3}{4}, \frac{2}{3}\right)\)
  2. \(\left(\begin{matrix} \frac{3}{80} & 0 \\ 0 & \frac{1}{18} \end{matrix} \right)\)

Suppose that \((X, Y)\) has probability density function \(f(x, y) = 15\,x^2\,y\) for \(0 \le x \le y \le 1\). Find each of the following:

  1. \(\E(X, Y)\)
  2. \(\vc(X, Y)\)
  3. \(L(Y|X)\)
  4. \(L(Y|X, X^2)\)
  5. Sketch the regression curves on the same set of axes.
Answer:
  1. \(\left( \frac{5}{8}, \frac{5}{6} \right)\)
  2. \(\left( \begin{matrix} \frac{17}{448} & \frac{5}{336} \\ \frac{5}{336} & \frac{5}{252} \end{matrix} \right)\)
  3. \(\frac{10}{17} + \frac{20}{51} X\)
  4. \(\frac{49}{76} + \frac{10}{57} X + \frac{7}{38} X^2\)

Suppose that \((X, Y, Z)\) is uniformly distributed on the region \(\{(x, y, z) \in \R^3: 0 \le x \le y \le z \le 1\}\). Find each of the following:

  1. \(\E(X, Y, Z)\)
  2. \(\vc(X, Y, Z)\)
  3. \(L(Z | X, Y)\)
  4. \(L(Y | X, Z)\)
  5. \(L(X | Y, Z)\)
Answer:
  1. \(\left(\frac{1}{4}, \frac{1}{2}, \frac{3}{4}\right)\)
  2. \(\left(\begin{matrix} \frac{3}{80} & \frac{1}{40} & \frac{1}{80} \\ \frac{1}{40} & \frac{1}{20} & \frac{1}{40} \\ \frac{1}{80} & \frac{1}{40} & \frac{3}{80} \end{matrix}\right)\)
  3. \(\frac{1}{2} + \frac{1}{2} Y\). Note that there is no \(X\) term.
  4. \(\frac{1}{2} X + \frac{1}{2} Z\). Note that this is the midpoint of the interval \([X, Z]\).
  5. \(\frac{1}{2} Y\). Note that there is no \(Z\) term.

Suppose that \(X\) is uniformly distributed on \((0, 1)\), and that given \(X\), random variable \(Y\) is uniformly distributed on \((0, X)\). Find each of the following:

  1. \(\E(X, Y)\)
  2. \(\vc(X, Y)\)
Answer:
  1. \(\left(\frac{1}{2}, \frac{1}{4}\right)\)
  2. \(\left(\begin{matrix} \frac{1}{12} & \frac{1}{24} \\ \frac{1}{24} & \frac{7}{144} \end{matrix} \right)\)