]>
As usual, our starting point is a random experiment with an underlying sample space and a probability measure . In the basic statistical model, we have an observable random variable taking values in a set . In general, can have quite a complicated structure. For example, if the experiment is to sample objects from a population and record various measurements of interest, then
where is the vector of measurements for the object. The most important special case occurs when are independent and identically distributed. In this case, we have a random sample of size from the common distribution.
The purpose of this section is to define and discuss the basic concepts of statistical hypothesis testing. Collectively, these concepts are sometimes referred to as the Neyman-Pearson framework, in honor of Jerzy Neyman and Egon Pearson, who first formalized them.
A statistical hypothesis is a statement about the distribution of the data variable . Equivalently, a statistical hypothesis specifies a set of possible distributions of (namely, the set of distributions for which the statement is true). In hypothesis testing, the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis. The null hypothesis is usually denoted while the alternative hypothesis is usually denoted . A hypothesis that specifies a single distribution for is called simple; a hypothesis that specifies more than one distribution for is called composite.
An hypothesis test is a statistical decision; the conclusion will either be to reject the null hypothesis in favor of the alternative, or to fail to reject the null hypothesis. The decision that we make must, of course, be based on the data vector . Thus, we will find a subset of the sample space and reject if and only if . The set is known as the rejection region or the critical region. Note the asymmetry between the null and alternative hypotheses. This asymmetry is due to the fact that we assume the null hypothesis, in a sense, and then see if there is sufficient evidence in to overturn this assumption in favor of the alternative.
Often, the critical region is defined in terms of a statistic , known as a test statistic. As usual, the use of a statistic allows data reduction when the dimension of the statistic is much smaller than the dimension of the data vector.
The ultimate decision may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true:
Similarly, there are two ways to make a correct decision: we could reject the null hypothesis when it is false or we could fail to reject the null hypothesis when it is true. The possibilities are summarized in the following table:
| State/Decision | Fail to reject | Reject |
|---|---|---|
| True | Correct | Type 1 error |
| False | Type 2 error | Correct |
If is true (that is, the distribution of is specified by ), then is the probability of a type 1 error for this distribution. If is composite, then specifies a variety of different distributions for and thus there is a set of type 1 error probabilities. The maximum probability of a type 1 error is known as the significance level of the test or the size of the critical region, which we will denote by . Usually, the rejection region is constructed so that the significance level is a prescribed, small value (typically 0.1, 0.05, 0.01).
If is true (that is, the distribution of is specified by ). then is the probability of a type 2 error for this distribution. Again, if is composite then specifies a variety of different distributions for . and thus there will be a set of type 2 error probabilities. Generally, there is a tradeoff between the type 1 and type 2 error probabilities. If we reduce the probability of a type 1 error, by making the rejection region smaller, we necessarily increase the probability of a type 2 error because the complementary region is larger.
If is true (that is, the distribution of is specified by ), then , the probability of rejecting (and thus making a correct decision), is known as the power of the test for the distribution.
Suppose that we have two tests, corresponding to rejection regions and , respectively, each having significance level . The test with region is uniformly more powerful than the test with region if
Naturally, in this case, we would prefer the first test. Often, however, two tests will not be uniformly ordered; one test will be more powerful for some distributions specified by while the other test will be more powerful for other distributions specified by . Finally, if a test has significance level and is uniformly more powerful than any other test with significance level . then the test is said to be a uniformly most powerful test at level . Clearly, such a test is the best we can do.
In most cases, we have a general procedure that allows us to construct a test (that is, a rejection region ) for any given significance level . Typically, decreases (in the subset sense) as decreases. In this context, the -value of the data variable . denoted is defined to be the smallest for which ; that is, the smallest significance level for which is rejected, given . Knowing allows us to test at any significance level, for the given data: If then we would reject at significance level ; if then we fail to reject at significance level . Note that is a statistic.
Hypothesis testing is a very general concept, but an important special class occurs when the distribution of the data variable depends on a parameter . taking values in a parameter space . The parameter may be vector-valued, so that and for some . The hypotheses generally take the form
where is a prescribed subset of the parameter space . In this setting, the probabilities of making an error or a correct decision depend on the true value of . If is the rejection region, then the power function is given by
Show that
Show that
Suppose that we have two tests, corresponding to rejection regions and , respectively, each having significance level . The test with rejection region is uniformly more powerful than the test with rejection region if
Most hypothesis tests of an unknown real parameter fall into three special cases:
where is a specified value. Case 1 is known as the two-sided test; case 2 is known as the left-tailed test, and case 3 is known as the right-tailed test (named after the conjectured alternative). There may be other unknown parameters besides (known as nuisance parameters).
There is an equivalence between hypothesis tests and confidence sets for a parameter .
Suppose that is a level confidence set for . Show that the test below has significance level for the hypothesis versus :
equivalently, we fail to reject at significance level if and only if is in the corresponding level confidence set.
In particular, show that this equivalence applies to interval estimates of a real parameter and the common tests for . In each case below, the confidence interval has confidence level and the test has significance level
Recall that confidence sets of an unknown parameter are often constructed through a pivot variable, that is, a random variable that depends on the data vector and the parameter . but whose distribution does not depend on . In this case, a natural test statistic is .