A Markov process is a random process in which the future is independent of the past, given the present. Markov processes, named for Andrei Markov are among the most important of all random processes. In a sense, they are the stochastic analogs of differential equations and recurrence relations, which are of course, among the most important deterministic processes.
Suppose that \( \bs{X} = (X_t: t \in T) \) is a random process on a probability space \( (\Omega, \mathscr{F}, \P) \) where \( X_t \) is a random variable taking values in \( S \) for each \( t \in T \). We think of \( X_t \) as the state of a system at time \( t \). The state space \( S \) is usually either a countable set or a nice
region of \( \R^k \) for some \( k \in \N_+ \). The time space \( T \) is either \( \N \) or \( [0, \infty) \).
For \( t \in T \), let \( \mathscr{F}_t \) denote the \( \sigma \)-algebra of events generated by \( (X_s: s \in T, s \le t) \). Intuitively, \( \mathscr{F}_t \) contains the events that can be defined in terms of \( X_s \) for \( s \le t \). In other words, if we are allowed to observe the random variables \( X_s \) for \( s \le t \), then we can tell whether or not a given event in \( \mathscr{F}_t \) has occurred. The family of \(\sigma\)-algebras \(\mathfrak{F} = \{\mathscr{F_t}: t \in T\}\) is known as a filtration.
The random process \( \bs{X} \) is a Markov process if the following property (known as the Markov property) holds: For every \( s, \; t \in T \) with \( s \le t \), and for every \( H \in \mathscr{F}_s \) and \( x \in S \), the conditional distribution of \( X_t \) given \( H \) and \( X_s = x \) is the same as the conditional distribution of \( X_t \) just given \( X_s = x \):
\[ \P(X_t \in A \mid H, X_s = x) = \P(X_t \in A \mid X_s = x), \quad A \subseteq S \]In the statement of the Markov property, think of \( s \) as the present time and hence \( t \) is a time in the future. Thus, \( x \) is the present state and \( H \) is an event that has occurred in the past. If we know the present state, then any additional knowledge of events in the past is irrelevant in terms of predicting the future.
The complexity of Markov processes depends greatly on whether the time space or the state space are discrete or continuous. In this chapter, we assume both are discrete, that is we assume that \( T = \N \) and that \( S \) is countable (and hence the state variables have discrete distributions). In this setting, Markov processes are known as Markov chains. The theory of Markov chains is very beautiful and very complete.
The Markov property for a Markov chain \( \bs{X} = (X_0, X_1, \ldots) \) can be stated as follows: for any sequence of states \( (x_0, x_1, \ldots, x_{n-1}, x, y) \),
\[ \P(X_{n+1} = y \mid X_0 = x_0, X_1 = x_1, \ldots, X_{n-1} = x_{n-1}, X_n = x) = \P(X_{n+1} = y \mid X_n = x) \]Suppose that \( \bs{X} = (X_0, X_1, \ldots) \) is a Markov chain with state space \( S \). As before, let \( \mathscr{F}_n \) is the \( \sigma \)-algebra generated by \( (X_0, X_1, \ldots, X_n) \) for each \( n \in \N \). A random variable \( \tau \) taking values in \( \N \cup \{\infty\} \) is a stopping time or a Markov time for \( \bs{X} \) if \( \{\tau = n\} \in \mathscr{F}_n \) for each \( n \in \N \). Intuitively, we can tell whether or not \( T = n \) by observing the chain up to time \( n \). In a sense, a stopping time is a random time that does not require that we see into the future.
The quintessential example of a stopping time is the hitting time to a nonempty set of states \( A \subseteq S \):
\[ \tau_A = \min\{n \in \N_+: X_n \in A\} \]where as usual, we define \( \min(\emptyset) = \infty \). This random variable gives the first positive time that the chain is in \( A \). For more information see the section on filtrations and stopping times.
The strong Markov property states that the future is independent of the past, given the present, when the present time is a stopping time. For a Markov chain, the ordinary Markov property implies the strong Markov property. Thus if \( \tau \) is a stopping time, then for any sequence of states \( (x_0, x_1, \ldots, x, y) \),
\[ \P(X_{\tau+1} = y \mid X_0 = x_0, X_1 = x_1, \ldots, X_\tau = x) = \P(X_{\tau+1} = y \mid X_\tau = x) \]The study of Markov chains is simplified by the use of operator notation that is analogous to operations on vectors and matrices. Suppose that \( U: S \times S \to \R \) and \( V: S \times S \to \R \) and that \( f: S \to \R \) and \( g: S \to \R \). Define \( U\,V: S \times S \to \R \), \( U\,f: S \to \R \), \( f\,U: S \to \R \), and \( f\,g \in R \) as follows:
\[ \begin{align} (U\,V)(x,z) & = \sum_{y \in S} U(x, y)\,V(y, z), \quad (x, z) \in S \times S \\ (U\,f)(x) & = \sum_{y \in S} U(x, y)\,f(y), \quad x \in S \\ (f\,U)(y) & = \sum_{x \in S} f(x)\,U(x, y), \quad y \in S\\ f\,g & = \sum_{x \in S} f(x)\,g(x) \end{align} \]In all of the definitions, we assume that if \( S \) is infinite then the either the terms in the sums are nonnegative or the sums converge absolutely, so that the order of the terms in the sums does not matter. We will often refer to a function \( U: S \times S \to \R \) as a matrix on \( S \). Indeed, if \( S \) is finite, then \( U \) really is a matrix in the usual sense, with rows and columns labeled by the elements of \( S \). The product \( U\,V \) is ordinary matrix multiplication; the product \( U\,f \) is the product of the matrix \( U \) and the column vector \( f \); the product \( f\,U \) is the product of the row vector \( f \) and the matrix \( U \); and the product \( f\,g \) is he product of \( f \) as a row vector and \( g \) as a column vector, or equivalently, the inner product of the vectors \( f \) and \( g \). The sum of two matrices on \( S \) or two functions on \( S \) is defined in the usual way, as is a real multiple of a matrix or function.
In the following exercises, suppose that \( U \), \( V \), and \( W \) are matrices on \( S \) and that \( f \) and \( g \) are functions on \( S \). Assume also that the sums exist.
The associate property holds whenever the operations makes sense. In particular,
The distributive property holds whenever the operations makes sense. In particular,
The commutative property does not hold in general. Give examples where
If \( U \) is a matrix on \( S \) we denote \( U^n = U \, U \cdots U \), the \( n \)-fold (matrix) power of \( U \) for \( n \in \N \). By convention, \( U^0 = I \), the identity matrix on \( S \), defined by
\[ I(x, y) = \bs{1}(x = y) = \begin{cases} 1, & x = y \\ 0, & x \ne y \end{cases}, \quad (x, y) \in S \times S \]If \( U \) is a matrix on \( S \) and \( A \subseteq S \), we will let \( U_A \) denote the restriction of \( U \) to \( A \times A \). Similarly, if \( f \) is a function on \( S \), then \( f_A \) denotes the restriction of \( f \) to \( A \).
If \( f: S \to [0, 1] \), then \( f \) is a discrete probability density function (or probability mass function) on \( S \) if
\[ \sum_{x \in S} f(x) = 1 \]We have studied these in detail! On the other hand, if \( P: S \times S \to [0, 1] \), then \( P \) is a transition probability matrix or stochastic matrix on \( S \) if
\[ \sum_{y \in S} P(x, y) = 1, \quad x \in S \]Thus, if \( P \) is a transition matrix, then \( y \mapsto P(x, y) \) is a probability density function on \( S \) for each \( x \in S \). In matrix terminology, the row sums are 1.
If \( P \) is a transition probability matrix on \( S \) then \( P \, \bs{1} = \bs{1} \) where \( \bs{1} \) is the constant function 1 on \( S \). Thus, in the language of linear algebra, \( \bs{1} \) is a right eigenvector of \( \P \), corresponding to the eigenvalue 1.
Suppose that \( P \) and \( Q \) are transition probability matrices on \( S \) and that \( f \) is a probability density function on \( S \). Then
Suppose that \( f \) is the probability density function of a random variable \( X \) taking values in \( S \) and that \( g: S \to \R \). Then \( f\,g = \E[g(X)] \) (assuming that the expected value exists).
A function \( f: S \to \R \) is said to be left-invariant for a transition probability matrix \( P \) if \( f\,P = f \). In the language of linear algebra, \( f \) is a left eigenvector of \( P \) corresponding to the eigenvalue 1.
If \( f: S \to \R \) is left-invariant for \( P \) then \( f\,P^n = f \) for each \( n \in \R \).
Suppose again that \( \bs{X} = (X_0, X_1, X_2, \ldots) \) is a Markov chain with state space \( S \). For \( m, \; n \in \N \) with \( m \le n \), let
\[ P_{m,n}(x, y) = \P(X_n = y \mid X_m = x), \quad (x, y) \in S \times S \]The matrix \( P_{m,n} \) is the transition probability matrix from time \( m \) to time \( n \). The result in the next exercise is known as the Chapman-Kolmogorov equation, named for Sydney Chapman and Andrei Kolmogorov. It gives the basic relationship between the transition matrices.
Suppose that \( k, \; m, \; n \in \N \) with \( k \le m \le n \). Then \( P_{k,m} P_{m,n} = P_{k,n} \)
This follows from basic properties of conditional probability and the Markov property.
It follows immediately that all of the transition probability matrices for \( \bs{X} \) can be obtained from the one-step transition probability matrices: if \( m, \; n \in \N \) with \( m \le n \) then
\[ P_{m,n} = P_{m,m+1} P_{m+1,m+2} \cdots P_{n-1,n} \]Suppose that \( m, \; n \in \N \) with \( m \le n \). If \( X_m \) has probability density function \( f_m \), then \( X_n \) has probability density function \( f_n = f_m P_{m,n} \).
Combining the last two results, it follows that the distribution of \( X_0 \) (the initial distribution) and the one-step transition matrices determine the distribution of \( X_n \) for each \( n \in \N \). Actually, these basic quantities determine the joint distributions of the process, a much stronger result.
Suppose that \( X_0 \) has probability density function \( f_0 \). For any sequence of states \( (x_0, x_1, \ldots, x_n), \),
\[ \P(X_0 = x_0, X_1 = x_1, \ldots, X_n = x_n) = f_0(x_0) P_{0,1}(x_0, x_1) \cdots P_{n-1,n}(x_{n-1},x_n) \]Computations of this sort are the reason for the term chain in the name Markov chain.
A Markov chain \( \bs{X} = (X_0, X_1, X_2, \ldots) \) is said to be time homogeneous if the transition matrix from time \( m \) to time \( n \) depends only on the difference \( n - m \) for \( m, \; n \in \N \) with \( m \le n \). That is,
\[ \P(X_n = y \mid X_m = x) = \P(X_{n-m} = y \mid X_0 = x), \quad (x, y) \in S \times S \]It follows that there is a single one-step transition probability matrix \( P \) given by
\[ P(x, y) = \P(X_{n+1} = y \mid X_n = x), \quad (x, y) \in S \times S \]and all other transition matrices can be expressed as powers of \( P \). Indeed if \( m, \; n \in \N \) with \( m \le n \) then \( P_{m,n} = P^{n-m} \), and the Chapman-Kolmogorov equation is simply the law of exponents for matrix powers. From Exercise 11, if \( X_m \) has probability density function \( f_m \) then \( f_n = f_m P^{n-m} \) is the probability density function of \( X_n \). In particular, if \( X_0 \) has probability density function \( f_0 \) then \( f_n = f_0 P^n\) is the probability density function of \( X_n \). The joint distribution in Exercise 12 above also simplifies: if \( X_0 \) has probability density function \( f_0 \), then for any sequence of states \( (x_0, x_1, \ldots, x_n) \),
\[ \P(X_0 = x_0, X_1 = x_1, \ldots, X_n = x_n) = f_0(x_0) P(x_0, x_1) \cdots P(x_{n-1},x_n) \]Suppose that \( X_0 \) has probability density function \( f \) and that \( f \) is left-invariant for \( P \). Them \( X_n \) has probability density function \( f \) for each \( n \in \N \) and hence the sequence of random variables \( \bs{X} \) is identically distributed (although certainly not independent in general).
In the context of the previous exercise, the probability distribution on \( S \) associated with \( f \) is said to be invariant or stationary for \( P \) or for the Markov chain \( \bs{X} \). Stationary distributions turn out to be of fundamental importance in the study of the limiting behavior of Markov chains.
The assumption of time homogeneity is not as restrictive as might first appear. The following exercise shows that any Markov chain can be turned into a homogeneous chain by enlarging the state space with a time component.
Suppose that \( \bs{X} = (X_0, X_1, \ldots) \) is an inhomogeneous Markov chain with state space \( S \), and let \( Q_n(x, y) = \P(X_{n+1} = y \mid X_n = x) \) denote the one-step transition probability at time \( n \in \N \), for \( (x, y) \in S \times S \). Suppose that \( N_0 \) is a random variable taking values in \( \N \), independent of \( \bs{X} \). Let \( N_n = N_0 + n \) and let \( Y_n = (X_{N_n}, N_n) \) for \( n \in \N \). Then \( \bs{Y} = (Y_0, Y_1, Y_2, \ldots) \) is a homogeneous Markov chain on \( S \times \N \) with transition probability matrix \( \P \) given by
\[ P[(x, m), (y, n)] = \begin{cases} Q_m(x, y), & n = m + 1 \\ 0, & n \ne m + 1 \end{cases}, \quad x, \; y \in S, \quad m, \; n \in \N \]From now on, unless otherwise noted, the term Markov chain will mean homogeneous Markov chain.
Suppose that \( \bs{X} = (X_0, X_1, X_2, \ldots) \) is an Markov chain with state space \( S \) and transition probability matrix \( P \). For fixed \( k \in \N \), \( (X_0, X_k, X_{2\,k}, \ldots) \) is a Markov chain on \( S \) with transition probability matrix \( P^k \).
The following exercise also uses the basic trick of enlarging the state space to turn a random process into a Markov chain.
Suppose that \( \bs{X} = (X_0, X_1, X_2, \ldots) \) is a random process on \( S \) in which the future depends stochastically on the last two states. Specifically, suppose that for any sequence of states \( (x_0, x_1, \ldots, x_{n-1}, x, y, z) \),
\[ \P(X_{n+2} = z \mid X_0 = x_0, X_1 = x_1, \ldots, X_{n-1} = x_{n-1}, X_n = x, X_{n+1} = y) = \P(X_{n+2} = z \mid X_n = x, X_{n+1} = y) \]We also assume that this probability is independent of \( n \), and we denote it by \( Q(x, y, z) \). Let \( Y_n = (X_n, X_{n+1}) \) for \( n \in \N \). Then \( \bs{Y} = (Y_0, Y_1, Y_2, \ldots) \) is a Markov chain on \( S \times S \) with transition probability matrix \( P \) given by
\[ P[(x, y), (w, z)] = \begin{cases} Q(x, y, z), & y = w \\ 0, & y \ne w \end{cases}, \quad x, \; y, \; z, \; w \in S \]The result in the last exercise generalizes in a completely straightforward way to the case where the future of the random process depends stochastically on the last \( k \) states, for some fixed \( k \in \N \).
Suppose again that \( \bs{X} = (X_0, X_1, X_2, \ldots) \) is a Markov chain with state space \( S \) and transition probability matrix \( P \). The directed graph associated with \( \bs{X} \) has vertex set \( S \) and edge set \( \{(x, y) \in S^2: P(x, y) \gt 0\} \). That is, there is a directed edge from \( x \) to \( y \) if and only if state \( x \) leads to state \( y \) in one step. Note that the graph may well have loops, since a state can certainly lead back to itself in one step.
If \( A \) is a nonempty subset of \( S \) then
\[ P_A^n(x, y) = \P(X_1 \in A, X_2 \in A, \ldots, X_{n-1} \in A, X_n = y \mid X_0 = x), \quad (x, y) \in A \times A \]That is, \( P_A^n(x, y) \) is the probability of going from state \( x \) to \( y \) in \( n \) steps, remaining in \( A \) all the while. In graphical terms, it is the sum of products of probabilities along paths of length \( n \) from \( x \) to \( y \) that stay inside \( A \). Note that \( P_A^n \) means \( (P_A)^n \), not \( (P^n)_A \); in general these matrices are different.
Perhaps the simplest, non-trivial Markov chain has two states, say \( S = \{0, 1\} \) and the transition probability matrix given below, where \( p \in (0, 1) \) and \( q \in (0, 1) \) are parameters.
\[ P = \left[ \begin{matrix} 1 - p & p \\ q & 1 - q \end{matrix} \right] \]For \( n \in \N \),
\[ P^n = \frac{1}{p + q} \left[ \begin{matrix} q + p(1 - p - q)^n & p - p(1 - p - q)^n \\ q - q(1 - p - q)^n & p + q(1 - p - q)^n \end{matrix} \right] \]The eigenvalues of \( P \) are 1 and \( 1 - p - q \). Next, \( B^{-1} P B = D \) where
\[ B = \left[ \begin{matrix} 1 & - p \\ 1 & q \end{matrix} \right], \quad D = \left[ \begin{matrix} 1 & 0 \\ 0 & 1 - p - q \end{matrix} \right] \]Hence \( P^n = B D^n B^{-1} \)
As \( n \to \infty \),
\[ P^n \to \frac{1}{p + q} \left[ \begin{matrix} q & p \\ q & p \end{matrix} \right] \]The only invariant probability density function for the chain is
\[ f = \left( \frac{q}{p + q}, \frac{p}{p + q} \right) \]In spite of its simplicity, the two state chain illustrates some of the basic limiting behavior and the connection with invariant distributions that we will study in general in a later section.
Suppose that \( \bs{X} = (X_0, X_1, X_2, \ldots) \) is a sequence of independent random variables taking values in a countable set \( S \), and that \( (X_1, X_2, \ldots) \) are identically distributed with (discrete) probability density function \( f \).
\( \bs{X} \) is a Markov chain on \( S \) with transition probability matrix \( P \) given by \( P(x, y) = f(y) \) for \( (x, y) \in S \times S \). Also, \( f \) is invariant for \( P \).
As a Markov chain, the process \( \bs{X} \) is not very interesting, although it is very interesting in other ways. Suppose now that \( S = \Z \), the set of integers, and consider the partial sum process (or random walk) \( \bs{Y} \) associated with \( \bs{X} \):
\[ Y_n = \sum_{i=0}^n X_i \]\( \bs{Y} \) is a Markov chain on \( \Z \) with transition probability matrix \( Q \) given by \( Q(x, y) = f(y - x) \) for \( (x, y) \in \Z \times \Z \).
Consider the special case where \( f(1) = p \) and \( f(-1) = 1 - p \), where \( p \in (0, 1) \). The transition probability matrix is given by
\[ Q(x, x - 1) = 1 - p, \; Q(x, x + 1) = p, \quad x \in \Z \]This special case is the simple random walk on \( \Z \). When \( p = \frac{1}{2} \) we have the simple, symmetric random walk. Simple random walks are studied in more detail in the chapter on Bernoulli Trials.
A matrix \( P \) on \( S \) is doubly stochastic if it is nonnegative and if the row and columns sums are 1:
\[ \sum_{u \in S} P(x, u) = 1, \; \sum_{u \in s} P(u, y) = 1, \quad (x, y) \in S \times S \]Suppose that \( \bs{X} \) is a Markov chain on a finite state space \( S \) with doubly stochastic transition probability matrix \( P \). Then the uniform distribution on \( S \) is invariant.
A matrix \( P \) on \( S \) is symmetric if \( P(x, y) = P(y, x) \) for all \( (x, y) \in S \times S \).
The Markov chains in the following exercises model important processes that are studied in separate sections. We will refer to these chains frequently.
Read the introduction to the Ehrenfest chains.
Read the introduction to the Bernoulli-Laplace chain.
Read the introduction to the reliability chains.
Read the introduction to the branching chain.
Read the introduction to the queuing chains.
Read the introduction to random walks on graphs.
Read the introduction to birth-death chains.