Suppose that our random experiment is to perform a sequence of Bernoulli trials :
\[ \bs{X} = (X_1, X_2, \ldots) \]Recall that \(\bs{X}\) is a sequence of independent indicator random variables with common probability of success \(p \in [0, 1]\), the basic parameter of the process. In statistical terms, the first \(n\) trails \((X_1, X_2, \ldots, X_n)\) form a random sample of size \(n\) from the Bernoulli distribution. In this section we will study the random variable that gives the number of successes in the first \(n\) trials and the random variable that gives the proportion of successes in the first \(n\) trials. The underlying distribution, the binomial distribution, is one of the most important in probability theory.
The number of successes in the first \(n\) trials is the random variable
\[ Y_n = \sum_{i=1}^n X_i \]Thus \(\bs{Y} = (Y_0, Y_1, \ldots)\) is the partial sum process associated with the Bernoulli trials sequence \(\bs{X}\).
The probability density function of \(Y_n\) is given by
\[ \P(Y_n = y) = \binom{n}{y} p^y (1 - p)^{n-y}, \quad y \in \{0, 1, \ldots, n\} \]Recall that if \((x_1, x_2 \ldots) \in \{0, 1\}^n\) with \(\sum_{i=1}^n x_i = y\) (that is, a bit string of length \(n\) with exactly \(y\) 1's), then by independence,
\[ \P[(X_1, X_2, \ldots, X_n) = (x_1, x_2, \ldots, x_n)] = p^y (1 - p)^{n - y} \]Moreover, the number of bit strings of length \(n\) with exactly \(y\) 1's is the binomial coefficient \(\binom{n}{y}\).
The distribution with this probability density function is known as the binomial distribution with parameters \(n\) and \(p\).
In the binomial coin experiment, vary \(n\) and \(p\) with the scrollbars, and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and note the apparent convergence of the relative frequency function to the probability density function.
The binomial theorem provides an additional check that the binomial probability density function really is a probability density function.
The binomial distribution is unimodal:
Thus, the density function at first increases and then decreases, reaching its maximum value at \(\lfloor (n + 1) p \rfloor\). This integer is a mode of the distribution. In the case that \(m = (n + 1) p\) is an integer between 1 and \(n\), there are two consecutive modes, at \(m - 1\) and \(m\).
If \(U\) is a random variable having the binomial distribution with parameters \(n\) and \(p\), then \(n - U\) has the binomial distribution with parameters \(n\) and \(1 - p\).
A simple probabilistic proof is based on the interpretation of \(U\) in terms of Bernoulli trials: If \(U\) is the number of successes in \(n\) Bernoulli trials, then \(n - U\) is the number of failures. A simple analytic proof can be constructed using the probability density function of \(U\): \(\P(n - U = k) = \P(U = n - k)\) for \(k \in \{0, 1, \ldots, n\}\)
The mean, variance and other moments of the binomial distribution can be computed in several different ways. Again let \(Y_n = \sum_{i=1}^n X_i\) where \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of Bernoulli trials with success parameter \(p\).
\(E(Y_n) = n \, p \).
This result follows immediately from the additive property of expected value and the representation of \(Y_n\) as a sum. Recall that \(\E(X_i) = p\) for each \(i\). The result can also be proven from the probability density function and the definition of expected value. For this proof, the identity \(y \binom{n}{y} = n \binom{n - 1}{y - 1}\) is useful.
The expected value of \(Y_n\) also makes intuitive sense, since \(p\) should be approximately the proportion of successes in a large number of trials. We will discuss the point further in the subsection on the proportion of successes.
\(\var(Y_n) = n \, p \, (1 - p)\)
This result follows from a basic property of variance: the variance of a sum of independent variables is the sum of the variances. Recall that \(\var(X_i) = p \, (1- p)\) for each \(i\). The result can also be proven from the definition of variance and the probability density function of \(Y_n\), although this proof is much more complicated.
The graph of \(\var(Y_n)\) as a function of \(p \in [0, 1]\) is parabola opening downward. In particular the maximum value of the variance is \(n / 4\) when \(p = 1 / 2\), and the minimm value is 0 when \(p = 0\) or \(p = 1\).
In the binomial coin experiment, vary \(n\) and \(p\) with the scrollbars and note the location and size of the mean/standard deviation bar. For selected values of the parameters, run the simulation 1000 times and note the apparent convergence of the sample mean and standard deviation to the distribution mean and standard deviation.
The probability generating function of \(Y_n\) is \( P_n(t) := \E \left(t^{Y_n} \right) = [(1 - p) + p \, t]^n, \quad t \in \R \).
Again, there are several ways to prove this. The simplest is to use the fact that the probability generating function of a sum of indpendent variables is the product of the probability generating functions of the terms. Recall that \( P(t) := \E(t^{X_i}) = (1 - p) + p \, t\) for each \(i\), so \( P_n(t) = P^n(t) \). A direct proof, using the probability density function requires the binomial theorm.
The probability generating function provides another way to compute the mean and variance.
Recall that \( P_n^{(k)}(1) = \E\left[Y_n^{(k)}\right] \)
The following is a recursive equation for the moments of the binomial distribution:
\[ \E (Y_n^k) = n \, p \, \E \left[ (Y_{n-1} + 1)^{k-1} \right], \quad n, \; k \in \N_+ \]Use the identity \(y \binom{n}{y} = n \binom{n - 1}{y - 1}\).
The recursion result in the previous exercise gives yet one more derivation of the mean and variance of the binomial distribution.
The proportion of successes in the first \(n\) trials is the random variable
\[ M_n = \frac{Y_n}{n} = \frac{1}{n} \sum_{i=1}^n X_i \]In statistical terms, \(M_n\) is the sample mean of the random sample \((X_1, X_2, \ldots, X_n)\). The proportion of successes \(M_n\) is typically used to estimate the probability of success \(p\) when this probability is unknown. It is basic to the very notion of probability that if the number of trials is large, then \(M_n\)should be close to \(p\). The mathematical formulation of this idea is a special case of the law of large numbers.
It is easy to express the probability density function of the proportion of successes \(M_n\) in terms of the probability density function of the number of successes \(Y_n\). First, note that \(M_n\) takes the values \(k / n\) where \(k \in \{0, 1, \dots, n\}\).
The probabiliy density function of \(M_n\) is given by
\[ \P \left( M_n = \frac{k}{n} \right) = \binom{n}{k} p^k (1 - p)^{n-k}, \quad k \in \{0, 1, \ldots, n\} \]In the binomial coin experiment, select the proportion of heads. Vary \(n\) and \(p\) with the scroll bars and note the shape of the probability density function. For selected values of the parameters, run the experiment 1000 timesand watch the apparent convergence of the relative frequency function to the probability density function.
\(\E(M_n) = p\). In statistical terms, this means that \(M_n\) is an unbiased estimator of \(p\).
\(\var(M_n) = \frac{p (1 - p)}{n}\). In particular, \(var(M_n) \le \frac{1}{4 \, n}\) for any \(p \in [0, 1]\).
Note that \(\var(M_n) \to 0\) as \(n \to \infty\) and the convergence is uniform in \(p \in [0, 1]\). Thus, the estimate improves as \(n\) increases; in statistical terms, this is known as consistency.
For every \(\epsilon \gt 0\), \(\P(|M_n - p| \ge \epsilon) \to 0\) as \(n \to \infty\) and the convergence is uniform in \(p \in [0, 1]\).
This follows from the last exercise and Chebyshev's inequality.
The result in the last exercise is a special case of the weak law of large numbers and means that \(M_n \to p\) as \(n \to \infty\) in probability. The strong law of large numbers states that the convergence actually holds with probability 1. See Estimation in the Bernoulli Model in the chapter on Set Estimation for a different approach to the problem of estimating \(p\).
In the binomial coin experiment, select the proportion of heads. Vary \(n\) and \(p\) and note the size and location of the mean/standard deviation bar. For selected values of the parameters, run the experiment 1000 times and note the apparent convergence of the empirical moments to the distribution moments.
Several important properties of the random process \(\bs{Y} = (Y_1, Y_2, \ldots)\) stem from the fact that it is a partial sum process corresponding to the sequence \(\bs{X} = (X_1, X_2, \ldots)\) of independent, identically distributed indicator variables.
\(\bs{Y}\) has stationary, independent increments:
Actually, any partial sum process corresponding to an independent, identically distributed sequence will have stationary, independent increments.
Suppose that \(U\) and \(V\) are independent random variables, and that \(U\) has the binomial distribution with parameters \(m\) and \(p\), and \(V\) has the binomial distribution with parameters \(n\) and \(p\). Then \(U + V\) has the binomial distribution with parameters \(m + n\) and \(p\).
There are several ways to prove this. The simplest is to use the representation in terms of Bernoulli trials. In the standard notation above, we can identify \(U\) with \(Y_m\) and \(V\) with \(Y_{m+n} - Y_m\). Another simple proof uses probability generating function. Recall again that the probability generating function of \(U + V\) is the product of the probability generating functions of \(U\) and \(V\). Finally, there is a direct proof using probability density functions. Recall that the PDF of \(U + V\) is the convolution of the PDFs of \(U\) and \(V\).
The joint probability density functions of the \(\bs{Y}\) are given as follows: if \(n_1 \lt n_2 \lt \cdots \lt n_k\) and \(y_1 \le y_2 \le \cdots \le y_k\) then
\[ \P(Y_{n_1} = y_1, Y_{n+2} = y_2, \ldots, Y_{n_k} = y_k) = \binom{n_1}{y_1} \binom{n_2 - n_1}{y_2 - y_1} \cdots \binom{n_k - n_{k-1}}{y_k - y_{k-1}} p^{y_k} (1 - p)^{n_k - y_k} \]where as always, we use the standard conventions for binomial coefficients: \(\binom{a}{b} = 0\) if \(b \lt 0\) or \(b \gt a\).
Suppose that \(m, \; n, \; k \in \N_+\) with \(m \le n\) and \(k \le n\) then
\[ \P(Y_m = j \mid Y_n = k) = \frac{\binom{m}{j} \binom{m - n}{k - j}}{\binom{n}{k}}, \quad j \in \{0, 1, \ldots, n\} \]Interestingly, the probability distribution in the last exercise is independent of \(p\). It is known as the hypergeometric distribution with parameters \(n\), \(m\), and \(k\), and governs the number of type 1 objects in a sample of size \(k\) chosen at random and without replacement from a population of \(n\) objects of which \(m\) are type 1 (and the remainder type 0). Try to interpret this result probabilistically.
Open the binomial timeline experiment. For selected values of \(p \in (0, 1)\), start with \(n = 1\) and successively increase \(n\) by 1. For each value of \(n\), Note the shape of the probability density function of the number of successes and the proportion of successes. With \(n = 100\), run the experiment 1000 times and note the apparent convergence of the empirical density function to the probability density function for the number of successes and the proportion of successes
The characteristic bell shape that you should observe in the previous exercise is an example of the central limit theorem, because the binomial variable can be written as a sum of \(n\) independent, identically distributed random variables (the indicator variables).
The standard score of \(Y_n\) is the same as the standard score of \(M_n\):
\[ \frac{Y_n - n \, p}{\sqrt{n \, p \, (1 - p)}} = \frac{M_n - p}{\sqrt{p \, (1 - p) / n}} \]The distribution of the standard score given in the previous exercise converges to the standard normal distribution as \(n \to \infty\).
This version of the central limit theorem is known as the DeMoivre-Laplace theorem, and is named after Abraham DeMoivre and Simeon Laplace. From a practical point of view, this result means that, for large \(n\), the distribution of \(Y_n\) is approximately normal, with mean \(n \, p\) and standard deviation \(\sqrt{n \, p \, (1 - p)}\) and the distribution of \(M_n\) is approximately normal, with mean \(p\) and standard deviation \(\sqrt{p \, (1 - p) / n}\). Just how large \(n\) needs to be for the normal approximation to work well depends on the value of \(p\). The rule of thumb is that we need \(n \, p \ge 5\) and \(n \, (1 - p) \ge 5\). Finally, when using the normal approximation, we should remember to use the continuity correction, since the binomial is a discrete distribution.
A student takes a multiple choice test with 20 questions, each with 5 choices (only one of which is correct). Suppose that the student blindly guesses. Let \(X\) denote the number of questions that the student answers correctly. Find each of the following:
A certain type of missile has failure probability 0.02. Let \(Y\) denote the number of failures in 50 tests. Find each of the following:
Suppose that in a certain district, 40% of the registered voters prefer candidate \(A\). A random sample of 50 registered voters is selected. Let \(Z\) denote the number in the sample who prefer \(A\). Find each of the following:
Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 have probability \(\frac{1}{4}\) each, and faces 2, 3, 4, and 5 have probability \(\frac{1}{8}\).
A standard, fair die is tossed 10 times. Let \(N\) denote the number of aces. Find each of the following:
A coin is tossed 100 times and results in 30 heads. Find the probability density function of the number of heads in the first 20 tosses.
An ace-six flat die is rolled 1000 times. Let \(Z\) denote the number of times that a score of 1 or 2 occurred. Find each of the following:
In the binomial coin experiment, select the proportion of heads. Set \(n = 10\) and \(p = 0.4\). Run the experiment 100 times. Over all 100 runs, compute the square root of the average of the squares of the errors, when \(M\) used to estimate \(p\). This number is a measure of the quality of the estimate.
In the binomial coin experiment, select the number of heads \(Y\), and set \(p = 0.5\) and \(n = 15\). Run the experiment 1000 times with and compute the following:
In the binomial coin experiment, select the proportion of heads \(M\) and set \(n = 30\), \(p = 0.6\). Run the experiment 1000 times and compute each of the following:
In 1693, Samuel Pepys asked Isaac Newton whether it is more likely to get at least one ace in 6 rolls of a die or at least two aces in 12 rolls of a die. This problems is known a Pepys' problem; naturally, Pepys had fair dice in mind.
Guess the answer to Pepys' problem based on empirical data. With fair dice and \(n = 6\), run the simulation of the dice experiment 500 times and compute the relative frequency of at least one ace. Now with \(n = 12\), run the simulation 500 times and compute the relative frequency of at least two aces. Compare the results.
Solve Pepys' problem using the binomial distribution.
Which is more likely: at least one ace with 4 throws of a fair die or at least one double ace in 24 throws of two fair dice? This is known as DeMere's problem, named after Chevalier De Mere.
In the cicada data, compute the proportion of males in the entire sample, and the proportion of males of each species in the sample.
In the M&M data, pool the bags to create a large sample of M&Ms. Now compute the sample proportion of red M&Ms.
The Galton board is a triangular array of pegs. The rows are numbered by the natural numbers \(\N = \{0, 1, \ldots\}\) from top downward. Row \(n\) has \(n + 1\) pegs numbered from left to right by the integers \(\{0, 1, \ldots, n\}\). Thus a peg can be uniquely identified by the ordered pair \((n, k)\) where \(n\) is the row number and \(k\) is the peg number in that row. The Galton board is named after Francis Galton.
Now suppose that a ball is dropped from above the top peg \((0, 0)\). Each time the ball hits a peg, it bounces to the to the right with probability \(p\) and to the left with probability \(1 - p\), independently from bounce to bounce.
The number of the peg that the ball hits in row \(n\) is the has the binomial distribution with parameters \(n\) and \(p\).
In the Galton board experiment, select random variable \(Y\) (the number of moves right). Vary the parameters \(n\) and \(p\) and note the shape and location of the probability density function and the mean/standard deviation bar. For selected values of the parameters, click single step several times and watch the ball fall through the pegs. Then run the experiment 1000 times and watch the path of the ball. Note the apparent convergence of the relative frequency function and empirical moments to the probability density function and distribution moments, respectively.
Recall the discussion of structural reliability given in the last section on Bernoulli trials. In particular, we have a system of \(n\) similar components that function independently, each with reliability \(p\). Suppose now that the system as a whole functions properly if and only if at least \(k\) of the \(n\) components are good. Such a systems is called, appropriately enough, a \(k\) out of \(n\) system. Note that the series and parallel systems considered in the previous section are \(n\) out of \(n\) and 1 out of \(n\) systems, respectively.
Consider the \(k\) out of \(n\) system.
In the binomial coin experiment, set \(n = 10\) and \(p = 0.9\) and run the simulation 1000 times. Compute the empirical reliability and compare with the true reliability in each of the following cases:
Consider a system with \(n = 4\) components. Sketch the graphs of \(r_{4,1}\), \(r_{4,2}\), \(r_{4,3}\), and \(r_{4,4}\) on the same set of axes.
An \(n\) out of \(2 \, n - 1\) system is a majority rules system.
In the binomial coin experiment, compute the empirical reliability, based on 100 runs, in each of the following cases. Compare your results to the true probabilities.
For the majority rules system, \(r_{2 \, n - 1, n} \left( \frac{1}{2} \right) = \frac{1}{2}\).
The Weierstrass Approximation Theorem, named after Karl Weierstrass, states that any real-valued function that is continuous on a closed, bounded interval can be uniformly approximated on that interval, to any degree of accuracy, with a polynomial. The theorem is important, since polynomials are simple and basic functions, and a bit surprising, since continuous functions can be quite strange.
In 1911, Sergi Bernstein gave an explicit construction of polynomials that uniformly approximate a given continuous function, using Bernoulli trials. Bernstein's result is a beautiful example of the probabilistic method, the use of probability theory to obtain results in other areas of mathematics that are seemingly unrelated to probability.
Suppose that \(f\) is a real-valued function that is continuous on the interval \([0, 1]\). The Bernstein polynomial of degree \(n\) for \(f\) is defined by
\[ b_n(p) = \E_p[f(M_n)], \quad p \in [0, 1] \]where \(M_n\) is the proportion of successes in the first \(n\) Bernoulli trials with success parameter \(p\), as defined earlier. Note that we are emphasizing the dependence on \(p\) in the expected value operator. The next exercise gives a more explicit representation, and shows that the Bernstein polynomial is, in fact, a polynomial
The Bernstein polynomial of degree \(n\) can be written as follows:
\[ b_n(p) = \sum_{k=0}^n f \left(\frac{k}{n} \right) \binom{n}{k} p^k (1 - p)^{n-k}, \quad p \in [0, 1] \]This follows from the change of variables theorem for expected value.
The Bernstein polynomials satisfy the following properties:
From part (a), the graph of \(b_n\) passes through the endpoints \((0, f(0))\) and \((1, f(1))\). From part (b), the graph of \(b_1\) is a line connecting the endpoints. From (c), the graph of \(b_2\) is parabola passing through the endpoints and the point \(\left( \frac{1}{2}, \frac{1}{4} \, f(0) + \frac{1}{2} \, f\left(\frac{1}{2}\right) + \frac{1}{4} \, f(1) \right)\).
Bernstein's theorem: \(b_n \to f\) as \(n \to \infty\) uniformly on \([0, 1]\).
Since \(f\) is continuous on the closed, bounded interval \([0, 1]\), it is bounded on this interval. Thus, there exists a constant \(C\) such that \(|f(p)| \le C\) for all \(p \in [0, 1]\). Also, \(f\) it is uniformly continuous on \([0, 1]\). Thus, for any \(\epsilon \gt 0\) there exists \(\delta \gt 0\) such that if \(p \in [0, 1]\), \(q \in [0, 1]\), and \(|p - q| \lt \delta\) then \(|f(p) - f(q)| \lt \epsilon\). From basic properties of expected value,
\[ |b_n(p) - f(p)| \le \E_p[|f(M_n) - f(p)|, \; |M_n - p| \lt \delta] + \E_p[|f(M_n) - f(p)|, \; |M_n - p| \ge \delta] \]Hence \(|b_n(p) - f(p)| \le \epsilon + 2 \, C \, \P_p(|M_n - p| \ge \delta)\) for any \(p \in [0, 1]\). But by Exercise 19, \(\P_p(|M_n - p| \ge \delta) \to 0\) as \(n \to \infty\) uniformly in \(p \in [0, 1]\).
Compute the Bernstein polynomials of orders 1, 2, and 3 for the function \(f\) defined by \(f(x) = \cos(\pi \, x), \quad x \in [0, 1]\). Graph \(f\) and the three polynomials on the same set of axes.
Use a computer algebra system to compute the Bernstein polynomials of orders 10, 20, and 30 for the function \(f\) defined below. Use the CAS to graph the function and the three polynomials on the same axes.
\[ f(x) = \begin{cases} 0, & x = 0 \\ x \, \sin\left(\frac{\pi}{x}\right), & x \in (0, 1] \end{cases} \]