]>
As usual, our starting point is a random experiment with probability measure on a sample space . Suppose that is a random variable taking values in a set and that is a random variable taking values in . In this section, we will study the conditional expected value of given , a concept of fundamental importance in probability. As we will see, the expected value of given is the function of that best approximates in the mean square sense. Note that in general, will be vector-valued. In this section, we will assume that all real-valued random variables occurring in expected values have finite second moment.
Note that we can think of as a random variable that takes values in a subset of . Suppose first that the that has a (joint) continuous distribution with probability density function . Recall that the (marginal) probability density function of is given by
and that the conditional probability density function of given is given by
Thus, the conditional expected value of given is simply the mean computed relative to the conditional distribution:
Of course, the conditional mean of depends on the given value of . Temporarily, let denote the function from into defined by
The function is sometimes refereed to as the regression function of based on . The random variable is called the conditional expected value of given and is denoted . The results and definitions above would be exactly the same if has a joint discrete distribution, except that sums would replace the integrals. Intuitively, we treat as known, and therefore not random, and we then average with respect to the probability distribution that remains.
The random variable satisfies a critical property that characterizes it among all functions of .
Suppose that is a function from into . Use the change of variables theorem for expected value to show that
In fact, the result in Exercise 1 can be used as a definition of conditional expected value, regardless of the type of the distribution of . Thus, generally we define to be the random variable that satisfies the condition in Exercise 1 and is of the form for some function from into . Then we define to be for . (More technically, is required to be measurable with respect to .)
Our first consequence of Exercise 1 is a formula for computing the expected value of .
By taking to be the constant function 1 in Exercise 1, show that
Aside from the theoretical interest, the result in Exercise 2 is often a good way to compute when we know the conditional distribution of given . We say that we are computing the expected value of by conditioning on .
The next exercise show that the condition in Exercise 1 characterizes
Suppose that and satisfy the condition in Exercise 1 and hence also the results in Exercise 2 and Exercise 3. Show that and are equivalent:
Suppose that . Use the characterization in Exercise 1 to show that
This result makes intuitive sense: if we know , then we know any (deterministic) function of . Any such function acts like a constant in terms of the conditional expected value with respect to . The following rule generalizes this result and is sometimes referred to as the substitution rule for conditional expected value.
Suppose that . Show that
In particular, it follows from Exercise 5 or Exercise 6 that . At the opposite extreme, we have the result in the next exercise. If and are independent, then knowledge of gives no information about and so the conditional expected value with respect to is the same as the ordinary (unconditional) expected value of .
Suppose that and are independent. Use the characterization in Exercise 1 to show that
Use the general definition to establish the properties in the following exercises, where and are real-valued random variables and is a constant. Note that these are analogies of basic properties of ordinary expected value. Every type of expected value must satisfy two critical properties: linearity and monotonicity.
Show that .
Show that .
Show that if with probability 1, then with probability 1.
Show that if with probability 1, then with probability 1.
Show that with probability 1.
Suppose now that is real-valued and that and are random variables (all defined on the same probability space, of course). The following exercise gives a consistency condition of sorts. Iterated conditional expected values reduce to a single conditional expected value with respect to the minimum amount of information:
Show that
The conditional probability of an event , given random variable , is a special case of the conditional expected value. As usual, let denote the indicator random variable of . We define
The properties above for conditional expected value, of course, have special cases for conditional probability.
Show that .
Again, the result in the previous exercise is often a good way to compute when we know the conditional probability of given . We say that we are computing the probability of by conditioning on . This is a very compact and elegant version of the conditioning result given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions.
The next two exercises show that, of all functions of , is the best predictor of , in the sense of minimizing the mean square error. This is fundamentally important in statistical problems where the predictor vector can be observed but not the response variable .
Let and let . By adding and subtracting , expanding, and using the result of Exercise 3, show that
Use the result of the last exercise to show that if , then
and equality holds if and only if with probability 1.
Suppose now that is real-valued. In the section on covariance and correlation, we found that the best linear predictor of based on is
On the other hand, is the best predictor of among all functions of . It follows that if happens to be a linear function of then it must be the case that
Show that
Show directly that if then
The conditional variance of given is naturally defined as follows:
Show that .
Show that .
Again, the result in the previous exercise is often a good way to compute when we know the conditional distribution of given . We say that we are computing the variance of by conditioning on .
Let us return to the study of predictors of the real-valued random variable , and compare the three predictors we have studied in terms of mean square error.
Suppose that has probability density function .
Suppose that has probability density function .
Suppose that has probability density function .
Suppose that has probability density function .
In the bivariate uniform experiment, select the square in the list box. Run the simulation 2000 times, updating every 10 runs. Note the relationship between the cloud of points and the graph of the regression function.
In the bivariate uniform experiment, select the triangle in the list box. Run the simulation 2000 times, updating every 10 runs. Note the relationship between the cloud of points and the graph of the regression function.
Suppose that is uniformly distributed on , and that given , is uniformly distributed on . Find each of the following:
A pair of fair dice are thrown, and the scores recorded. Let denote the sum of the scores and the minimum score. Find each of the following:
A box contains 10 coins, labeled 0 to 9. The probability of heads for coin is . A coin is chosen at random from the box and tossed. Find the probability of heads. This problem is an example of Laplace's rule of succession,
Suppose that is a sequence of independent and identically distributed real-valued random variables. We will denote the common mean, variance, and moment generating function, respectively, by: , , and . Let
so that is the partial sum process associated with . Suppose now that is a random variable taking values in , independent of . Then
is a random sum of random variables; the terms in the sum are random, and the number of terms is random. This type of variable occurs in many different contexts. For example, might represent the number of customers who enter a store in a given period of time, and the amount spent by the customer .
Show that
Wald's equation, named for Abraham Wald, is a generalization of the result in the previous exercise to the case where is not necessarily independent of , but rather is a stopping time for . Wald's equation is discussed in the section on Partial Sums in the chapter on Random Samples.
Show that
Let denote the probability generating function of . Show that the moment generating function of is .
In the die-coin experiment, a fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let denote the die score and the number of heads.
Run the die-coin experiment 1000 times, updating every 10 runs. Note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.
The number of customers entering a store in a given hour is a random variable with mean 20 and standard deviation 3. Each customer, independently of the others, spends a random amount of money with mean $50 and standard deviation $5. Find the mean and standard deviation of the amount of money spent during the hour.
A coin has a random probability of heads and is tossed a random number of times . Suppose that is uniformly distributed on ; has the Poisson distribution with parameter ; and and are independent. Let denote the number of heads. Compute the following:
Suppose that is a sequence of real-valued random variables. Let , , and for . Suppose also that is a random variable taking values in , independent of . Denote the probability density function of by for . The distribution of the random variable is a mixture of the distributions of , with the distribution of as the mixing distribution.
Show that
Show that .
Show that .
Show that
In the coin-die experiment, a biased coin is tossed with probability of heads . If the coin lands tails, a fair die is rolled; if the coin lands heads, an ace-six flat die is rolled (faces 1 and 6 have probability each, and faces 2, 3, 4, 5 have probability each). Find the mean and standard deviation of the die score.
Run the coin-die experiment 1000 times, updating every 10 runs. Note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.
Conditional expectation can be interpreted in terms vector of space concepts. This connection can help illustrate many of the properties of conditional expectation from a different point of view.
Recall that the vector space consists of all real-valued random variables defined on a fixed sample space (that is, relative to the same random experiment) that have finite second moment. Recall that two random variables are equivalent if they are equal with probability 1. We consider two such random variables as the same vector, so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the usual multiplication of a real-valued random variable by a real (non-random) number.
Recall also that is an inner product space, with inner product given by
Suppose now that a general random variable defined on the sample space , and that is a real-valued random variable in .
Show that the set below is a subspace of :
Reconsider Exercise 3 to show that is the projection of on to the subspace .
Suppose now that . Recall that the set
is also a subspace of , and in fact is clearly also a subspace of . We showed that the is the projection of onto .