]> Order Statistics
  1. Virtual Laboratories
  2. 11. Finite Sampling Models
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9

4. Order Statistics

Basic Theory

Random Variables

Suppose that the objects in our population are numbered from 1 to m , so that D 1 2 m . For example, the population might consist of manufactured items, and the labels might correspond to serial numbers. As in the basic sampling model we select n objects at random, without replacement from D :

X X 1 X 2 X n

where X i D is the i th object chosen. Recall that X is uniformly distributed over the set of permutations of size n chosen from D . Recall also that

W X 1 X 2 X n

is the unordered sample, which is uniformly distributed on the set of combinations of size n chosen from D .

For i 1 2 n let

X n i i  smallest element of  X 1 X 2 X n

The random variable X n i is known as the order statistic of order i for the sample X . Note that in particular, the extreme order statistics are

X n 1 X 1 X 2 X n X n n X 1 X 2 X n

Show that X n i takes values in i i 1 m n 1 .

We will denote the vector of order statistics by

Y X n 1 X n 2 X n n

Note that Y takes values in L x 1 x 2 x n D n x 1 x 2 x n .

Run the order statistic experiment. Note that you can vary the population size m and the sample size n . The order statistics are recorded on each update.

Distributions

Show that L has m n elements and that Y is uniformly distributed on L . Hint: Y x 1 x 2 x n if and only if W x 1 x 2 x n if and only if X is one of the n permutations of x 1 x 2 x n .

Use a combinatorial argument to show that the probability density function of X n i is

X n i k k 1 i 1 m k n i m n ,  k i i 1 m n i

In the order statistic experiment, vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the experiment 1000 times, updating very 10 runs. Note the apparent convergence of the relative frequency function to the probability density function.

Moments

The probability density function in Exercise 4 can be used to obtain an interesting identity involving the binomial coefficients. This identity, in turn, can be used to find the mean and variance of X n i .

Show that for 1 i n m ,

k i m n i k 1 i 1 m k n i m n

Show that

X n i i m 1 n 1
  1. Start with the definition of expected value.
  2. Show that k k 1 i 1 i k i
  3. Use the identity in the Exercise 6 with m replaced with m 1 , n replaced with n 1 , and i replaced with i 1 .
  4. Simplify the result.

Use the identity in Exercise 6 to show that

X n i i n i 1 m 1 m n n 1 2 n 2

In the order statistic experiment, vary the parameters and note the size and location of the mean/standard deviation bar. For selected values of the parameters, run the experiment 1000 times, updating every 10 runs. Note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.

Estimators of m based on the order statistics

Use the result of Exercise 7 to show that for i 1 2 n , the following statistic is an unbiased estimator of m :

U n i n 1 i X n i 1

Since U n i is unbiased, its variance is the mean square error, a measure of the quality of the estimator.

Use the result of Exercise 8 to show that

U n i m 1 m n n i 1 i n 2

Show that for fixed m and n , U n i decreases as i increases. Thus, the estimators improve as i increases; in particular, U n n is the best and U n 1 the worst.

Verify the following ratio, known as the relative efficiency of U n j with respect to U n i :

U n i U n j j n i 1 i n j 1 .

Note that the relative efficiency depends only on the orders i and j and the sample size n , but not on the population size m . In particular, the relative efficiency of U n n with respect to U n 1 is n 2 .

Usually, we hope that an estimator improves (in the sense of mean square error) as the sample size n increases (the more information we have, the better our estimate should be). This general idea is known as consistency.

Verify the following result. Thus, U n n decreases to 0 as n increases from 1 to m , and so U n n is consistent:

U n n m 1 m n n n 2

Show that for fixed i , U n i at first increases and then decreases to 0 as n increases from i to m . Thus, U n i is inconsistent.

The following graph shows U n 1 as a function of n for m 100 .

The variance of Un,1 as a function of n

An Estimator of m based on the sample mean

In this subsection, we will derive another estimator of the parameter m based on the average of the sample variables M n 1 n i 1 n X n , (the sample mean) and compare this estimator with the estimator based on the maximum of the variables (the largest order statistic).

Show that M n m 1 2 .

  1. Recall that X i is uniformly distributed on D for each i .
  2. Show or recall that X i m 1 2 .

It follows that V n 2 M n 1 is an unbiased estimator of m . Moreover, it seems that superficially at least, V n uses more information from the sample (since it involves all of the sample variables) than U n n . Could it be better? To find out, we need to compute the variance of the estimator (which, since it is unbiased, is the mean square error). This computation is a bit complicated since the sample variables are dependent. We will compute the variance of the sum as the sum of all of the pairwise covariances.

Show that X i X j m 1 12 for i j .

  1. First recall that given X i k , X j is uniformly distributed on D k .
  2. Now show that X j X i k m m 1 2 m 1 k m 1 .
  3. Now use a conditioning argument to show that X i X j m 1 3 m 2 12 .
  4. Finally use the results in (c) and 16(b), and the standard formula for covariance.

Recall or show that X i m 2 1 12 .

Show that M n m 1 m n 12 n .

  1. In the sum of the pairwise covariances, there are n terms with the value given in Exericse 18.
  2. In the sum of the pairwise covariances, there are n 2 n terms with the value given in Exercise 17.

Finally, show that V n m 1 m n 3 n .

The variance in Exercise 20 is decreasing with n , so the estimator V n is also consistent. Let's compute the relative efficiency of the estimator based on the maximum to the estimator based on the mean.

Show that V n U n n n 2 3 .

Thus, once again, the estimator based on the maximum is better.

Sampling with Replacement

If the sampling is with replacement, then the sample X X 1 X 2 X n is a sequence of independent and identically distributed random variables. The order statistics from such samples are studied in the chapter on Random Samples.

Examples and Applications

Suppose that in a lottery, tickets numbered from 1 to 25 are placed in a bowl. Five tickets are chosen at random and without replacement.

  1. Find the probability density function of X 5 3 .
  2. Find X 5 3 .
  3. Find X 5 3 .

The German Tank Problem

The estimator U n n was used by the Allies during World War II to estimate the number of German tanks m that had been produced. German tanks had serial numbers, and captured German tanks and records formed the sample data. The statistical estimates turned out to be much more accurate than intelligence estimates. Some of the data are given in the table below.

German Tank Data
Date Statistical Estimate Intelligence Estimate German Records
June 1940 169 1000 122
June 1941 244 1550 271
August 1942 327 1550 342

One of the morals, evidently, is not to put serial numbers on your weapons!

Suppose that in a certain war, 5 enemy tanks have been captured. The serial numbers are 51, 3, 27, 82, 65. Compute the estimate of m , the total number of tanks, using all of the estimators discussed above.

In the order statistic experiment, and set m 100 , n 10 . Run the experiment 50 times, updating after each run. For each run, compute the estimate of m based on each order statistic. For each estimator, compute the square root of the average of the squares of the errors over the 50 runs. Based on these empirical error estimates, rank the estimators of m in terms of quality.

Suppose that in a certain war, 10 enemy tanks have been captured. The serial numbers are 304, 125, 417, 226, 192, 340, 468, 499, 87, 352. Compute the estimate of m , the total number of tanks, using the estimator based on the maximum and the estimator based on the sum.