Suppose that we have a random experiment with sample space \(S\). Intuitively, the probability of an event is a measure of how likely the event is to occur when we run the experiment. Mathematically, probability is a function on the collection of events that satisfies certain axioms.
A probability measure (or probability distribution) \(\P\) for a random experiment is a real-valued function, defined on the collection of events, that satisifes the following axioms:
Axiom (c) is known as countable additivity, and states that the probability of a union of a finite or countably infinite collection of disjoint events is the sum of the corresponding probabilities. The axioms are known as the Kolmogorov axioms, in honor of Andrei Kolmogorov.
Axioms (a) and (b) are really just a matter of convention; we choose to measure the probability of an event with a number between 0 and 1 (as opposed, say, to a number between \(-5\) and \(7\)). Axiom (c) however, is fundamental and inescapable. It is required for probability for precisely the same reason that it is required for other measures of the size
of a set, such as
In all these cases, the size of a set that is composed of countably many disjoint pieces is the sum of the sizes of the pieces. For more on general measures, see the advanced section on measure theory.
On the other hand, uncountable additivity (the extension of axiom (c) to an uncountable index set \(I\)) is unreasonable for probability, just as it is for other measures. For example, an interval of positive length in \(\R\) is a union of uncountably many points, each of which has length 0.
We now have defined the three essential ingredients for the model a random experiment:
Together these define a probability space \((S, \mathscr{S}, \P)\).
Intuitively, the probability of an event is supposed to measure the long-term relative frequency of the event—in fact, this concept was taken as the definition of probability by Richard Von Mises. Specifically, suppose that we repeat the experiment indefinitely. (Note that this actually creates a new, compound experiment.) For an event \(A\) in the basic experiment, let \(N_n(A)\) denote the number of times \(A\) occurred (the frequency of \(A\)) in the first \(n\) runs. (Note that this is a random variable in the compound experiment.) Thus, \[P_n(A) = \frac{N_n(A)}{n}\] is the relative frequency of \(A\) in the first \(n\) runs (it is also a random variable in the compound experiment). If we have chosen the correct probability measure for the experiment, then in some sense we expect that the relative frequency of each event should converge to the probability of the event: \[P_n(A) \to \P(A) \text{ as } n \to \infty\] The precise statement of this is the law of large numbers or law of averages, one of the fundamental theorems in probability. To emphasize the point, note that in general there will be lots of possible probability measures for an experiment, in the sense of the axioms. However, only the probability measure that models the experiment correctly will satisfy the law of large numbers.
It follows that if we have the data from \(n\) runs of the experiment, the observed relative frequency \(P_n(A)\) can be used as an approximation for \(\P(A)\); this approximation is called the empirical probability of \(A\).
The function \(P_n\) satisfies the axioms of a probability measure (given the data from \(n\) runs of the experiment).
Suppose that \(X\) is a random variable for the experiment, taking values in a set \(T\).
The function \(B \mapsto \P(X \in B)\) defines a probability measure on \(T\).
Recall that the inverse image preserves all set operations. In particular, \(\P(X \in T) = \P(S) = 1\). If \(\{B_i: i \in I\}\) is a countable collection of disjoint subsets of \(T\) then \(\left\{\{X \in B_i\}: i \in I\right\}\) is a countable collection of disjoint subsets of \(S\), and \(\left\{X \in \bigcup_{i \in I} B_i\right\} = \bigcup_{i \in I} \{X \in B_i\}\). Hence \[ \P\left(X \in \bigcup_{i \in I} B_i\right) = \P\left(\bigcup_{i \in I} \{X \in B_i\}\right) = \sum_{i \in I} \P(X \in B_i) \]
The probability measure in the previous exercise is called the probability distribution of \(X\). Thus, any random variable \(X\) for an experiment defines a new probability space:
Moreover, recall that the outcome of the experiment itself can be thought of as a random variable. Specifically, if we take \(X\) to be the identity function on \(S\), then \(X\) is a random variable and \[\P(X \in A) = \P(A)\] Thus, any probability measure can be thought of as the distribution of a random variable.
How can we construct probability measures? As noted briefly above, there are other measures of the size
of sets; in many cases, these can be converted into probability measures. First, a (nonnegative) measure \(\mu\) on \(S\) is a real-valued function defined on the collection of events \(\mathscr{S}\) that satisfies axioms (a) and (c) above. In general, \(\mu(A)\) is allowed to be infinite for a subset \(A\). However, if \(\mu(S)\) is positive and finite, then \(\mu\) can easily be re-scaled into a probability measure.
If \(\mu\) is a measure on \(S\) with \(0 \lt \mu(S) \lt \infty\) then \(\P\) defined below is a probability measure on \(S\). \[\P(A) = \frac{\mu(A)}{\mu(S)}, \quad A \subseteq S\]
Clearly \(\P(A) \ge 0\) since \(\mu(A) \ge 0\) and \(0 \lt \mu(S) \lt \infty\). If \(\{A_i: i \in I\}\) is a countable collection of disjoint events then \[ \P\left(\bigcup_{i \in I} A_i \right) = \frac{1}{\mu(S)} \mu\left(\bigcup_{i \in I} A_i \right) = \frac{1}{\mu(S)} \sum_{i \in I} \mu(A_i) = \sum_{i \in I} \frac{\mu(A_i)}{\mu(S)} = \sum_{i \in I} \P(A_i) \] Trivially axiom (b) is satisifed since \(\P(S) = \mu(S) \big/ \mu(S) = 1\).
In this context, \(\mu(S)\) is called the normalizing constant. In the next two subsections, we consider some very important special cases.
Suppose that \( S \) is a finite set. Recall that \( \#(A) \) denotes the number of elements in \( A \subseteq S \), and that \( \# \) is counting measure on \( S \). The probability measure corresponding to \( \# \) is particularly important in combinatorial and sampling experiments.
Suppose that \(S\) is a finite, nonempty set. The discrete uniform distribution on \(S\) is given by \[\P(A) = \frac{\#(A)}{\#(S)}, \quad A \subseteq S\]
The underlying model is refereed to as the classical probability model, because historically the very first problems in probability (involving coins and dice) fit this model. More generally, suppose that the sample space \(S\) is countable. If \(\P\) is a probability measure on \(S\), then clearly from the countable additivity axiom (c), \(\P\) is completely determined by its values on the singleton events. Specifically, if we define \(f(x) = \P\left(\{x\}\right)\) for \(x \in S\), then \(\P(A) = \sum_{x \in A} f(x)\) for every \(A \subseteq S\). By axiom (a), \(f(x) \ge 0\) for \(x \in S\) and by axiom (b), \(\sum_{x \in S} f(x) = 1\). Conversely, we can give a general construction for defining a probability measure on \(S\).
Suppose that \( g: S \to [0, \infty) \). Then \(\mu\) defined below is a measure on \(S\): \[\mu(A) = \sum_{x \in A} g(x), \quad A \subseteq S\]
Trivially \(\mu(A) \ge 0\) for \(A \subseteq S\) since \(g\) is nonnegative. The countable additivity property holds since the terms in a sum of nonnegative numbers can be rearranged in any way without altering the sum. Thus let \(\{A_i: i \in I\}\) be a countable collection of disjoint subsets of \(S\), and let \(A = \bigcup_{i \in I} A_i\) then \[ \mu(A) = \sum_{x \in A} g(x) = \sum_{i \in I} \sum_{x \in A_i} g(x) = \sum_{i \in I} \mu(A_i) \]
Thus, if \(0 \lt \mu(S) \lt \infty\) then \[ \P(A) = \frac{\mu(A)}{\mu(S)} = \frac{\sum_{x \in A} g(x)}{\sum_{x \in S} g(x)}, \quad A \subseteq S\] defines a probability measure by the result above. In the context of our previous remarks, \(f(x) = g(x) \big/ \mu(S) = g(x) \big/ \sum_{y \in S} g(y)\) for \( x \in S \). Distributions of this type are said to be discrete. Discrete distributions are studied in detail in the chapter on Distributions.
In the setting of previous result, if \(S\) is finite and \(g\) is a constant function, then the corresponding probability measure \(\P\) is the discrete uniform distribution on \(S\).
Suppose that \(g(x) = c\) for \(x \in S\) where \(c \gt 0\). Then \(\mu(A) = c \#(A) \) and hence \(\P(A) = \mu(A) \big/ \mu(S) = \#(A) \big/ \#(S)\).
We define the standard \(n\)-dimensional measure \(\lambda_n\) on \(\R^n\) (called Lebesgue measure, in honor of Henri Lebesgue) by \[\lambda_n(A) = \int_A 1 d \bs{x}\] Technically, the integral is more general than the one defined in calculus, however the standard calculus integral will suffice for our discussion in this section. If you are interested in the advanced treatment, see the section on measure theory in this chapter, and the section on the abstract integral in the chapter on Distributions. Note that if \(n \gt 1\), the integral above is a multiple integral; \(\bs{x} = (x_1, x_2, \ldots, x_n)\) and \(d \bs{x} = dx_1 dx_2 \cdots dx_n\). The countable additivity axiom holds because of a basic property of integrals. In particular, note from calculus that
When \(n \gt 3\), \(\lambda_n(A)\) is sometimes called the \(n\)-dimensional volume of \(A\). The probability measure associated with \( \lambda_n \) on a set with positive, finite \( n \)-dimensional volume is particularly important.
Suppose that \(S \subseteq \R^n\) with \(0 \lt \lambda_n(S) \lt \infty\). The continuous uniform distribution on \( S \) is defined by \[\P(A) = \frac{\lambda_n(A)}{\lambda_n(S)}, \quad A \subseteq S\]
Note that the continuous uniform distribution is analogous to the discrete uniform distribution defined above, but with Lebesgue measure \( \lambda_n \) replacing counting measure \( \# \). We can generalize this construction to produce many other distributions.
Suppose that \( S \subseteq \R^n \) and that \( g: S \to [0, \infty) \). Then \( \mu \) defined below is a measure on \( S \): \[\mu(A) = \int_A g(x) \, dx, \quad A \subseteq S\]
Thus if \(0 \lt \mu(S) \lt \infty\), then \[\P(A) = \frac{\mu(A)}{\mu(S)} = \frac{\int_A g(x) \, dx}{\int_S g(x) \, dx}, \quad A \subseteq S\] defines a probability measure by the result above. Distributions of this type are said to be continuous. Continuous distributions are studied in detail in the chapter on Distributions. Note that the continuous distribution above is analogous to the discrete distribution defined earlier, but with integrals replacing sums.
It is important to note again that, unlike many other areas of mathematics, the low-dimensional spaces (\(n = 1, 2, 3\)) do not play a special role in probability, except for exposition. For example in the Cicada data, some of the variables recorded are body weight, body length, wing width, and wing length. A probability model for these variables would specify a distribution on a subset of \(\R^4\).
Suppose that we have a random experiment with sample space \(S\) and probability measure \(\P\). In the following results, \(A\) and \(B\) are events. These results follow easily from the axioms of probability, so be sure to try the proofs yourself before reading the ones in the text.
\(\P(A^c) = 1 - \P(A)\). This is known as the complement rule.
Note that \(A\) and \(A^c\) are disjoint and \(S = A \cup A^c\). Hence \(1 = \P(S) = \P(A) + \P(A^c)\).
\(\P(\emptyset) = 0\).
This follows from the the complement rule applied to \(A = S\).
\(\P(B \setminus A) = \P(B) - \P(A \cap B)\). This is known as the difference rule.
Note that \(A \cap B\) and \(B \setminus A\) are disjoint and \(B = (A \cap B) \cup (B \setminus A)\). Hence \(\P(B) = \P(A \cap B) + \P(B \setminus A)\).
If \(A \subseteq B\) then \(\P(B \setminus A) = \P(B) - \P(A)\).
This result is a corollary of the difference rule. Note that \(A \cap B = A\).
If \(A \subseteq B\) then \(\P(A) \le \P(B)\).
This result is a corollary of the previous result. Note that \( \P(B \setminus A) \ge 0 \) and hence \( \P(B) - \P(A) \ge 0 \).
Thus, \(\P\) is an increasing function, relative to the subset partial order on the collection of events, and the ordinary order on \(\R\). In particular, it follows that \(\P(A) \le 1\) for any event \(A\).
Suppose that \(A \subseteq B\).
The next result is known as Boole's inequality, named after George Boole. The inequality gives a simple upper bound on the probability of a union.
If \(\{A_i: i \in I\}\) is a countable collection of events then \[\P\left( \bigcup_{i \in I} A_i \right) \le \sum_{i \in I} \P(A_i)\]
Without loss of generality, we can assume that \(I = \{1, 2, \ldots\}\). Define \(B_1 = A_1\) and \(B_n = A_n \setminus \left(\bigcup_{i=1}^{n-1} A_i\right) \) for \(n \in \{2, 3, \ldots\}\). Note that \(\{B_1, B_2, \ldots\}\) is a pairwise disjoint collection and has the same union as \(\{A_1, A_2, \ldots\}\). Note also that \( B_i \subseteq A_i \) for each \( i \). Thus, from the additivity axiom and the increasing property, \[ \P\left(\bigcup_{i=1}^\infty A_i\right) = \P\left(\bigcup_{i=1}^\infty B_i\right) = \sum_{i=1}^\infty \P(B_i) \le \sum_{i=1}^\infty \P(A_i) \]
Intuitively, Boole's inequality holds because parts of the union have been measured more than once in the sum of the probabilities on the right. Of course, the sum of the probabilities may be greater than 1, in which case Boole's inequality is not helpful. The following result is a simple consequence of Boole's inequality.
If \(\{A_i: i \in I\}\) is a countable collection of events with \(\P(A_i) = 0\) for each \(i \in I\), then \[\P\left( \bigcup_{i \in I} A_i \right) = 0\]
An event \(A\) with \(\P(A) = 0\) is said to be null. Thus, a countable union of null events is still a null event.
The next result is known as Bonferroni's inequality, named after Carlo Bonferroni. The inequality gives a simple lower bound for the probability of an intersection.
If \(\{A_i: i \in I\}\) is a countable collection of events then \[\P\left( \bigcap_{i \in I} A_i \right) \ge 1 - \sum_{i \in I}\left[1 - \P(A_i)\right]\]
By De Morgan's law, \( \left(\bigcap_{i \in I} A_i\right)^c = \bigcup_{i \in I} A_i^c \). Hence by Boole's inequality, \[ \P\left[\left(\bigcap_{i \in I} A_i\right)^c\right] \le \sum_{i \in I} \P(A_i^c) = \sum_{i \in I} \left[1 - \P(A_i)\right] \] Using the complement rule again gives Bonferroni's inequality.
Of course, the lower bound in Bonferroni's inequality may be less than or equal to 0, in which case it's not helpful. The following result is a simple consequence of Bonferroni's inequality.
If \(\{A_i: i \in I\}\) is a countable collection of events with \(\P(A_i) = 1\) for each \(i \in I\), then \[\P\left( \bigcap_{i \in I} A_i \right) = 1\]
An event \(A\) with \(\P(A) = 1\) is sometimes called almost sure or almost certain. Thus, a countable intersection of almost sure events is still almost sure.
Suppose that \(A\) and \(B\) are events in an experiment.
Suppose that \(\{A_i: i \in I\}\) is a countable collection of events that partition the sample space \(S\). Recall that this means that the events are disjoint and their union is \(S\). For any event \(B\), \[\P(B) = \sum_{i \in I} \P(A_i \cap B)\]
This follows from the countable additivity axiom, since \(\{A_i \cap B: i \in I\}\) is a partition of \(B\). That is, these events are disjoint and their union is \(B\).
Naturally, this result is useful when the probabilities of the intersections are known. Partitions usually arise in connection with a random variable. Suppose that \(X\) is a random variable taking values in a countable set \(T\), and that \(B\) is an event. Then \[\P(B) = \sum_{x \in T} \P(X = x, B)\] In this formula, note that the comma acts like the intersection symbol in the previous formula.
The inclusion-exclusion formulas provide a method for computing the probability of a union of events in terms of the probabilities of the intersections of the events. The formula is useful because often the probabilities of the intersections are easier to compute.
\(\P(A \cup B) = \P(A) + \P(B) - \P(A \cap B)\) for any events \(A\) and \(B\).
Note first that \(A \cup B = A \cup (B \setminus A)\) and the latter two sets are disjoint. From the additivity axiom and the difference rule, \[ \P(A \cup B) = \P\left[A \cup (B \setminus A)\right] = \P(A) + \P(B \setminus A) = \P(A) + \P(B) - \P(A \cap B) \]
\(\P(A \cup B \cup C) = \P(A) + \P(B) + \P(C) - \P(A \cap B) - \P(A \cap C) - \P(B \cap C) + \P(A \cap B \cap C)\) for any events \(A\), \(B\), and \(C\).
First note that \( A \cup B \cup C = (A \cup B) \cup [C \setminus (A \cup B)] \). The event in parentheses and the event in square brackets are disjoint. Thus, using the additivity axiom and the difference rule, \[ \P(A \cup B \cup C) = \P(A \cup B) + \P(C) - \P\left[C \cap (A \cup B)\right] = \P(A \cup B) + \P(C) - \P\left[(C \cap A) \cup (C \cap B)\right] \] Using the inclusion-exclusion rule for two events (twice) we have \[ \P(A \cup B \cup C) = \P(A) + \P(B) - \P(A \cap B) + \P(C) - \left[\P(C \cap A) + \P(C \cap B) - \P(A \cap B \cap C)\right] \]
Note that \( A \cup B \cup C \) is the union of the 7 disjoint, colored events in the picture below (not counting the gray event). We will argue that each of these events is measured precisely once in the inclusion-exclusion formula.
The last two results can be generalized to a union of \(n\) events. For the remainder of this subsection, suppose that \( \{A_i: i \in I\} \) is a collection of events where \( I \) is an index set with \( \#(I) = n \).
The general inclusion-exclusion formula. \[\P\left( \bigcup_{i \in I} A_i \right) = \sum_{k = 1}^n (-1)^{k - 1} \sum_{J \subseteq I, \; \#(J) = k} \P\left( \bigcap_{j \in J} A_j \right)\]
The proof is by induction on \(n\). We have already established the formula for \( n = 2 \) and \( n = 3 \). Thus, suppose that the inclusion-exclusion formula holds for a given \( n \), and suppose that \( (A_1, A_2, \ldots, A_{n+1}) \) is a sequence of \( n + 1 \) events. Then \[ \bigcup_{i=1}^{n + 1} A_i = \left(\bigcup_{i=1}^n A_i \right) \cup \left[ A_{n+1} \setminus \left(\bigcup_{i=1}^n A_i\right) \right] \] As before, the event in parentheses and the event in square brackets are disjoint. Thus using the additivity axiom, the difference rule, and the distributive rule we have \[ \P\left(\bigcup_{i=1}^{n+1} A_i\right) = \P\left(\bigcup_{i=1}^n A_i\right) + \P(A_{n+1}) - \P\left(\bigcup_{i=1}^n (A_{n+1} \cap A_i) \right) \] By the induction hypothesis, the inclusion-exclusion formula holds for each union of \( n \) events on the right. Applying the formula and simplifying gives the inclusion-exclusion formula for \( n + 1 \) events.
This is the general version of the same argument we used above for 3 events. \( \bigcup_{i \in I} A_i \) is the union of the disjoint events of the form \( \left(\bigcap_{i \in K} A_i\right) \cap \left(\bigcap_{i \in K^c} A_i\right)\) where \( K \) is a nonempty subset of the index set \( I \). In the inclusion-exclusion formula, the event corresponding to a given \( K \) is measured in \( \P\left(\bigcap_{j \in J} A_j\right) \) for every nonempty \( J \subseteq K \). Suppose that \( \#(K) = k \). Accounting for the positive and negative signs, the net measurement is \( \sum_{j=1}^k (-1)^{j-1} \binom{k}{j} = 1 \).
The general inclusion-exclusion formula is not worth remembering in detail, but only in pattern. We start with the sum of the probabilities of the events, then subtract the probabilities of all of the paired intersections, then add the probabilities of the third-order intersections, and so forth, alternating signs, until we get to the probability of the intersection of all of the events.
The general Bonferroni inequalities state that if sum on the right in the general inclusion-exclusion formula is truncated, then the truncated sum is an upper bound or a lower bound for the probability of the union, depending on whether the last term has a positive or negative sign.
Suppose that \( m \in \{1, 2, \ldots, n - 1\} \). Then
Let \( P_k = \sum_{J \subseteq I, \; \#(J) = k} \P\left( \bigcap_{j \in J} A_j \right) \), the absolute value of the \( k \)th term in the inclusion-exclusion formula. The result follows since the inclusion-exclusion formula is an alternating series, and \( P_k \) is decreasing in \( k \).
More elegant proofs of the inclusion-exclusion formula and the Bonferroni inequalities can be constructed using expected value.
Note that there is a term in the inclusion-exclusion formula for every nonempty subset \( J \) of the index set \( I \), namely \( \P\left(\bigcap_{j \in J} A_j\right) \) with a positive or negative sign, and hence there are \( 2^n - 1 \) terms. The probabilities of these intersections suffice to compute the probability of any event that can be constructed from the given events, not just the union.
The probability of any event that can be constructed from \( \{A_i: i \in I\} \) can be computed from the \( 2^n - 1 \) probabilities of the form \( \P\left(\bigcap_{j \in J} A_j\right) \) where \( J \) is a nonempty subset of \( I \).
If you go back and look at your proofs of the rules of probability above, you will see that they hold for any finite measure \(\mu\), not just probability. The only change is that the number 1 is replaced by \(\mu(S)\). In particular, the inclusion-exclusion rule is as important in combinatorics (the study of counting measure) as it is in probability.
Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A) = \frac{1}{3}\), \(\P(B) = \frac{1}{4}\), \(\P(A \cap B) = \frac{1}{10}\). Express each of the following events in the language of the experiment and find its probability:
Suppose that \(A\), \(B\), and \(C\) are events in an experiment with \(\P(A) = 0.3\), \(\P(B) = 0.2\), \(\P(C) = 0.4\), \(\P(A \cap B) = 0.04\), \(\P(A \cap C) = 0.1\), \(\P(B \cap C) = 0.1\), \(\P(A \cap B \cap C) = 0.01\). Express each of the following events in set notation and find its probability:
Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A \setminus B) = \frac{1}{6}\), \(\P(B \setminus A) = \frac{1}{4}\), and \(\P(A \cap B) = \frac{1}{12}\). Find the probability of each of the following events:
Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A) = \frac{2}{5}\), \(\P(A \cup B) = \frac{7}{10}\), and \(\P(A \cap B) = \frac{1}{6}\). Find the probability of each of the following events:
Suppose that \(A\), \(B\), and \(C\) are events in an experiment with \(\P(A) = \frac{1}{3}\), \(\P(B) = \frac{1}{4}\), \(\P(C) = \frac{1}{5}\).
Open the simple probability experiment.
Suppose that \( A \), \( B \), and \( C \) are events in a random experiment with \( \P(A) = 1/4 \), \( \P(B) = 1/3 \), \( \P(C) = 1/6 \), \( \P(A \cap B) = 1/18 \), \( \P(A \cap C) = 1/16 \), \( \P(B \cap C) = 1/12 \), and \( \P(A \cap B \cap C) = 1/24 \). Find the probabilities of the various unions:
Suppose that \( A \), \( B \), and \( C \) are events in a random experiment with \( \P(A) = 1/4 \), \( \P(B) = 1/4 \), \( \P(C) = 5/16 \), \( \P(A \cup B) = 7/16 \), \( \P(A \cup C) = 23/48 \), \( \P(B \cup C) = 11/24 \), and \( \P(A \cup B \cup C) = 7/12 \). Find the probabilities of the various intersections:
Suppose that \( A \), \( B \), and \( C \) are events in a random experiment. Explicitly give all of the Bonferroni inequalities for \( \P(A \cup B \cup C) \)
Consider the random experiment of tossing a coin \(n\) times and recording the sequence of scores \(\bs{X} = (X_1, X_2, \ldots, X_n)\) (where 1 denotes heads and 0 denotes tails). This experiment is a generic example of \(n\) Bernoulli trials, named for Jacob Bernoulli. Note that the sample space is \(\{0, 1\}^n\), the set of bit strings of length \(n\). If the coin is fair, then presumably, by the very meaning of fair, we have no reason to prefer one point in \(S\) over another. Thus, as a modeling assumption, it seems reasonable to give \(S\) the uniform probability distribution in which all outcomes are equally likely.
Suppose that a fair coin is tossed 3 times and the sequence of coin scores is recorded. Let \(A\) be the event that the first coin is heads and \(B\) the event that there are exactly 2 heads. Give each of the following events in list form, and then compute the probability of the event:
In the Coin experiment, select 3 coins. Run the experiment 1000 times, updating after every run, and compute the empirical probability of each event in the previous exercise.
Suppose that a fair coin is tossed 4 times and the sequence of scores is recorded. Let \(Y\) denote the number of heads. Give the event \(\{Y = k\}\) (as a subset of the sample space) in list form, for each \(k \in \{0, 1, 2, 3, 4\}\), and then give the probability of the event.
Suppose that a fair coin is tossed \(n\) times and the sequence of scores is recorded. Let \(Y\) denote the number of heads.
\[\P(Y = k) = \binom{n}{k} \left( \frac{1}{2} \right)^n, \quad k \in \{0, 1, \ldots, n\}\]The number of bit strings of length \(n\) is \(2^n\), and since the coin is fair, these are equally likely. The number of bit strings of length \(n\) with exactly \(k\) 1's is \(\binom{n}{k}\). Hence the probability of exactly \(k\) 1s is \(\binom{n}{k} / 2^n\).
The distribution of \(Y\) in the last exercise is a special case of the binomial distribution. The binomial distribution is studied in more detail in the chapter on Bernoulli Trials.
Consider the experiment of throwing \(n\) distinct, \(k\)-sided dice (with sides numbered from 1 to \(k\)) and recording the sequence of scores \(\bs{X} = (X_1, X_2, \ldots, X_n)\). We can record the outcome as a sequence because of the assumption that the dice are distinct; you can think of the dice as somehow labeled from 1 to \(n\). The special case \(k = 6\) corresponds to standard dice. In general, note that the sample space is \(S = \{1, 2, \ldots, k\}^n\). If the dice are fair, then again, by the very meaning of fair, we have no reason to prefer one point in the sample space over another, so as a modeling assumption it seems reasonable to give \(S\) the uniform probability distribution.
Suppose that two fair, standard dice are thrown and the sequence of scores recorded. Let \(A\) denote the event that the first die score is less than 3 and \(B\) the event that the sum of the dice scores is 6. Give each of the following events in list form and then find the probability of the event.
In the dice experiment, set \(n = 2\). Run the experiment 100 times and compute the empirical probability of each event in the previous exercise.
Consider again the dice experiment with \(n = 2\) fair dice. Let \(Y\) denote the sum of the scores, \(U\) the minimum score, and \(V\) the maximum score.
\(y\) | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|
\(\P(Y = y)\) | \(\frac{1}{36}\) | \(\frac{2}{36}\) | \(\frac{3}{36}\) | \(\frac{4}{36}\) | \(\frac{5}{36}\) | \(\frac{6}{36}\) | \(\frac{5}{36}\) | \(\frac{4}{36}\) | \(\frac{3}{36}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) |
\(u\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(\P(U = u)\) | \(\frac{11}{36}\) | \(\frac{9}{36}\) | \(\frac{7}{36}\) | \(\frac{5}{36}\) | \(\frac{3}{36}\) | \(\frac{1}{36}\) |
\(v\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(\P(V = v)\) | \(\frac{1}{36}\) | \(\frac{3}{36}\) | \(\frac{5}{36}\) | \(\frac{6}{36}\) | \(\frac{7}{36}\) | \(\frac{11}{36}\) |
In the previous exercise, note that \((U, V)\) could serve as the outcome vector for the experiment of rolling two standard, fair dice if we do not bother to distinguish the dice (so that we might as well record the smaller score first and then the larger score). Note that this random vector does not have a uniform distribution. On the other hand, we might have chosen at the beginning to just record the unordered set of scores and, as a modeling assumption, imposed the uniform distribution on the corresponding sample space. Both models cannot be right, so which model (if either) describes real dice in the real world? It turns out that for real (fair) dice, the ordered sequence of scores is uniformly distributed, so real dice behave as distinct objects, whether you can tell them apart or not. In the early history of probability, gamblers sometimes got the wrong answers for events involving dice because they mistakenly applied the uniform distribution to the sample space of unordered scores. It's an important moral. If we are to impose the uniform distribution on a sample space, we need to make sure that it's the right sample space.
A pair of fair, standard dice are thrown repeatedly until the sum of the scores is either 5 or 7. Let \(A\) denote the event that the sum of the scores on the last throw is 5 rather than 7. Events of this type are important in the game of craps.
Let \(D_5 = \{(1,4), (2,3), (3,2), (4,1)\}\), \(D_7 = \{(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)\}\), \(D = D_5 \cup D_7\), \(C = \{1, 2, 3, 4, 5, 6\}^2 \setminus D\)
The previous problem shows the importance of defining the sample space appropriately. Sometimes a clever choice of the sample space can turn a difficult problem into an easy one.
Recall that many random experiments can be thought of as sampling experiments. For the general finite sampling model, we start with a population \(D\) with \(m\) (distinct) objects. We select a sample of \(n\) objects from the population, so that the sample space \(S\) is the set of possible samples. If we select a sample at random then the outcome \(\bs{X}\) (the random sample) is uniformly distributed on \(S\): \[\P(\bs{X} \in A) = \frac{\#(A)}{\#(S)}, \quad A \subseteq S\] Recall from the section on Combinatorial Structures that there are four common types of sampling from a finite population, based on the criteria of order and replacement.
If we sample with replacement, the sample size \(n\) can be any positive integer. If we sample without replacement, the sample size cannot exceed the population size, so we must have \(n \in \{1, 2, \ldots, m\}\).
The basic coin and dice experiments are examples of sampling with replacement. If we toss a fair coin \(n\) times and record the sequence of scores \(\bs{X}\) (where as usual, 0 denotes tails and 1 denotes heads), then \(\bs{X}\) is a random sample of size \(n\) chosen with order and with replacement from the population \(\{0, 1\}\). Thus, \(\bs{X}\) is uniformly distributed on \(\{0, 1\}^n\). If we throw \(n\) (distinct) standard fair dice and record the sequence of scores, then we generate a random sample \(\bs{X}\) of size \(n\) with order and with replacement from the population \(\{1, 2, 3, 4, 5, 6\}\). Thus, \(\bs{X}\) is uniformly distributed on \(\{1, 2, 3, 4, 5, 6\}^n\). Of an analogous result would hold for fair, \(k\)-sided dice.
Suppose that the sampling is without replacement (the most common case). If we record the ordered sample \(\bs{X} = (X_1, X_2, \ldots, X_n)\), then the unordered sample \(\bs{W} = \{X_1, X_2, \ldots\}\) is a random variable (that is, a function of \(\bs{X}\)). On the other hand, if we just record the unordered sample \(\bs{W}\) in the first place, then we cannot recover the ordered sample.
Suppose that \(\bs{X}\) is a random sample of size \(n\) chosen with order and without replacement from \(D\), so that \(\bs{X}\) is uniformly distributed on the space of permutations of size \(n\) from \(D\). Then \(\bs{W}\), the unordered sample, is uniformly distributed on the space of combinations of size \(n\) from \(D\). Thus, \(\bs{W}\) is also a random sample.
Let \(\bs{w}\) be a combination of size \(n\) from \(D\). Then there are \(n!\) permutations of the elements in \(\bs{w}\). If \(A\) denotes this set of permutations, then \(\P(\bs{W} = \bs{w}) = \P(\bs{X} \in A) = n! / m^{(n)} = 1 / \binom{m}{n}\).
The result in the last exercise does not hold if the sampling is with replacement (recall the exercise with two dice above and the discussion that follows). When sampling with replacement, there is no simple relationship between the number of ordered samples and the number of unordered samples.
Suppose again that we have a population \(D\) with \(m\) (distinct) objects, but suppose now that each object is one of two types--either type 1 or type 0. Such populations are said to be dichotomous. Here are some specific examples:
Suppose that the population \(D\) has \(r\) type 1 objects and hence \(m - r\) type 0 objects. Of course, we must have \(r \in \{0, 1, \ldots, m\}\). Now suppose that we select a sample of size \(n\) at random from the population. Note that this model has three parameters: the population size \(m\), the number of type 1 objects in the population \(r\), and the sample size \(n\). Let \(Y\) denote the number of type 1 objects in the sample.
If the sampling is without replacement then \[\P(Y = y) = \frac{\binom{r}{y} \binom{m - r}{n - y}}{\binom{m}{n}}, \quad y \in \{0, 1, \ldots, n\}\]
Recall that the unordered sample is uniformly distributed over the set of combinations of size \(n\) chosen from the population. There are \(\binom{m}{n}\) such samples. By the multiplication principle, the number of samples with exactly \(y\) type 1 objects and \(n - y\) type 0 objects is \(\binom{r}{y} \binom{m - r}{n - y}\).
In the previous exercise, random variable \(Y\) has the hypergeometric distribution with parameters \(m\), \(r\), and \(n\). The hypergeometric distribution is studied in more detail in the chapter on Findite Sampling Models.
If the sampling is with replacement then \[\P(Y = y) = \binom{n}{y} \frac{r^y (m - r)^{n-y}}{m^n} = \binom{n}{y} \left( \frac{r}{m}\right)^y \left( 1 - \frac{r}{m} \right)^{n - y}, \quad y \in \{0, 1, \ldots, n\}\]
Recall that the ordered sample is uniformly distributed over the set \(D^n\) and there are \(m^n\) elements in this set. To count the number of samples with exactly \(y\) type 1 objects, we use a three-step procedure: first, select the coordinates for the type 1 objects; there are \(\binom{n}{y}\) choices. Next select the \(y\) type 1 objects for these coordinates; there are \(r^y\) choices. Finally select the \(n - y\) type 0 objects for the remaining coordinates of the sample; there are \((m - r)^{n - y}\) choices. The result now follows from the multiplication principle.
In the last exercise, random variable \(Y\) has the binomial distribution with parameters \(n\) and \(p = \frac{r}{m}\). The binomial distribution is studied in more detail in the chapter on Bernoulli Trials.
Suppose that a group of voters consists of 40 democrats and 30 republicans. A sample 10 voters is chosen at random. Find the probability that the sample contains at least 4 democrats and at least 4 republicans, each of the following cases:
Look for other specialized sampling situations in the exercises below.
Drawing balls from an urn is a standard metaphor in probability for sampling from a finite population.
Consider an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, without replacement. Let \(Y\) denote the number of red balls in the sample. Explicitly compute \(\P(Y = y)\) for each \(y \in \{0, 1, 2, 3, 4, 5\}\).
\(y\) | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
\(\P(Y = y)\) | \(\frac{2584}{23751}\) | \(\frac{8075}{23751}\) | \(\frac{8550}{23751}\) | \(\frac{3800}{23751}\) | \(\frac{700}{23751/}\) | \(\frac{42}{23751}\) |
In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling without replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you computed in the previous exercise.
Consider again an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, with replacement. Let \(Y\) denote the number of red balls in the sample. Explicitly compute \(\P(Y = y)\) for each \(y \in \{0, 1, 2, 3, 4, 5\}\).
\(y\) | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
\(\P(Y = y)\) | \(\frac{32}{243}\) | \(\frac{80}{243}\) | \(\frac{80}{243}\) | \(\frac{40}{243}\) | \(\frac{10}{243}\) | \(\frac{1}{243}\) |
In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling with replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you computed in the previous exercise.
An urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, without replacement.
Suppose again that an urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, with replacement.
Recall that a standard card deck can be modeled by the product set \[D = \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k\} \times \{\clubsuit, \diamondsuit, \heartsuit, \spadesuit\}\] where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example \(q \heartsuit\)).
Card games involve choosing a random sample without replacement from the deck \(D\), which plays the role of the population. Thus, the basic card experiment consists of dealing \(n\) cards from a standard deck without replacement; in this special context, the sample of cards is often referred to as a hand. Just as in the general sampling model, if we record the ordered hand \(\bs{X} = (X_1, X_2, \ldots, X_n)\), then the unordered hand \(\bs{W} = \{X_1, X_2, \ldots, X_n\}\) is a random variable (that is, a function of \(\bs{X}\)). On the other hand, if we just record the unordered hand \(\bs{W}\) in the first place, then we cannot recover the ordered hand. Finally, recall that \(n = 5\) is the poker experiment and \(n = 13\) is the bridge experiment. The game of poker is treated in more detail in the chapter on Games of Chance. By the way, it takes about 7 standard riffle shuffles to randomize a deck of cards.
Suppose that 2 cards are dealt from a well-shuffled deck and the sequence of cards is recorded. For \(i \in \{1, 2\}\), let \(H_i\) denote the event that card \(i\) is a heart. Find the probability of each of the following events.
Think about the results in the previous exercise, and suppose that we continue dealing cards. Note that in computing the probability of \(H_i\), you could conceptually reduce the experiment to dealing a single card. Note also that the probabilities do not depend on the order in which the cards are dealt. For example, the probability of an event involving the 1st, 2nd and 3rd cards is the same as the probability of the corresponding event involving the 25th, 17th, and 40th cards. Technically, the cards are exchangeable. Here's another way to think of this concept: Suppose that the cards are dealt onto a table in some pattern, but you are not allowed to view the process. Then no experiment that you can devise will give you any information about the order in which the cards were dealt.
In the card experiment, set \(n = 2\). Run the experiment 100 times and compute the empirical probability of each event in the previous exercise
In the poker experiment, find the probability of each of the following events:
Run the poker experiment 10000 times, updating every 10 runs. Compute the empirical probability of each event in the previous problem.
Find the probability that a bridge hand will contain no honor cards that is, no cards of denomination 10, jack, queen, king, or ace. Such a hand is called a Yarborough, in honor of the second Earl of Yarborough.
\(\frac{347\,373\,600}{635\,013\,559\,600} \approx 0.000547\)
Find the probability that a bridge hand will contain
A card hand that contains no cards in a particular suit is said to be void in that suit. Use the inclusion-exclusion rule to find the probability of each of the following events:
The following problem is known as the birthday problem, and is famous because the results are rather surprising at first.
Suppose that \(n\) persons are selected and their birthdays recorded (we will ignore leap years). Let \(A\) denote the event that the birthdays are distinct, so that \(A^c\) is the event that there is at least one duplication in the birthdays.
\(n\) | \(\P(A)\) | \(\P(A^c)\) |
---|---|---|
10 | 0.883 | 0.117 |
20 | 0.589 | 0.411 |
30 | 0.294 | 0.706 |
40 | 0.109 | 0.891 |
50 | 0.006 | 0.994 |
The small value of \(\P(A)\) for relatively small sample sizes \(n\) is striking, but is due mathematically to the fact that \(365^n\) grows much faster than \(365^{(n)}\) as \(n\) increases. The birthday problem is treated in much more generality in the chapter on Finite Sampling Models.
Suppose that 4 persons are selected and their birth months recorded. Assuming an appropriate uniform distribution, find the probability that the months are distinct.
\(\frac{11880}{20736} \approx 0.573\)
Recall that in Buffon's coin experiment, a coin with radius \(r \le \frac{1}{2}\) is tossed randomly
on a floor with square tiles of side length 1, and the coordinates \((X, Y)\) of the center of the coin are recorded, relative to axes through the center of the square in which the coin lands (with the axes parallel to the sides of the square, of course). Let \(A\) denote the event that the coin does not touch the sides of the square.
randomly, no region of \(S\) should be preferred over any other.
In Buffon's coin experiment, set \(r = 0.2\). Run the experiment 100 times and compute the empirical probability of each event in the previous exercise.
A point \((X, Y)\) is chosen at random in the circular region \(S \subset \R^2\) of radius 1, centered at the origin. Let \(A\) denote the event that the point is in the inscribed square region centered at the origin, with sides parallel to the coordinate axes, and let \(B\) denote the event that the point is in the inscribed square with vertices \((\pm 1, 0)\), \((0, \pm 1)\). Sketch each of the following events as a subset of \(S\), and find the probability of the event.
Suppose a point \((X, Y)\) is chosen at random in the circular region \(S \subseteq \R^2\) of radius 12, centered at the origin. Let \(R\) denote the distance from the origin to the point. Sketch each of the following events as a subset of \(S\), and compute the probability of the event. Is \(R\) uniformly distributed on the interval \([0, 12]\)?
No, \(R\) is not uniformly distributed on \([0, 12]\).
In the simple probability experiment, points are generated according to the uniform distribution on a rectangle. Move and resize the events \( A \) and \( B \) and note how the probabilities of the various events change. Create each of the following configurations. In each case, run the experiment 1000 times and compare the relative frequencies of the events to the probabilities of the events.
Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this subsection.
Recall first that the ABO blood type in humans is determined by three alleles: \(a\), \(b\), and \(o\). Furthermore, \(a\) and \(b\) are co-dominant and \(o\) is recessive. Suppose that the probability distribution for the set of blood genotypes in a certain population is given in the following table:
Genotype | \(aa\) | \(ab\) | \(ao\) | \(bb\) | \(bo\) | \(oo\) |
---|---|---|---|---|---|---|
Probability | 0.050 | 0.038 | 0.310 | 0.007 | 0.116 | 0.479 |
A person is chosen at random from the population. Let \(A\), \(B\), \(AB\), and \(O\) be the events that the person is type \(A\), type \(B\), type \(AB\), and type \(O\) respectively. Let \(H\) be the event that the person is homozygous and \(D\) the event that the person has an \(o\) allele. Find the probability of the following events:
Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: \(g\) for green and \(y\) for yellow, and that \(g\) is dominant.
Let \(G\) be the event that a child plant has green pods. Find \(\P(G)\) in each of the following cases:
Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let \(h\) denote the healthy allele and \(d\) the defective allele for the gene linked to the disorder. Recall that \(d\) is recessive for women.
Let \(B\) be the event that a son has the disorder, \(C\) the event that a daughter is a healthy carrier, and \(D\) the event that a daughter has the disease. Find \(\P(B)\), \(\P(C)\) and \(\P(D)\) in each of the following cases:
From this exercise, note that transmission of the disorder to a daughter can only occur if the mother is at least a carrier and the father has the disorder. In ordinary large populations, this is a unusual intersection of events, and thus sex-linked hereditary disorders are typically much less common in women than in men. In brief, women are protected by the extra X chromosome.
Suppose that \(T\) denotes the time between emissions (in milliseconds) for a certain type of radioactive material, and that \(T\) has the following probability distribution: \[\P(T \in A) = \int_A e^{-t} \, dt, \quad A \subseteq (0, \infty)\]
Suppose that \(N\) denotes the number of emissions in a one millisecond interval for a certain type of radioactive material, and that \(N\) has the following probability distribution: \[\P(N \in A) = \sum_{n \in A} \frac{e^{-1}}{n!}, \quad A \subseteq \N\]
The probability distribution defined in the (61) is a special case of the exponential distribution, while the probability distribution defined in (62) is a special case of the Poisson distribution, named for Simeon Poisson. The exponential distribution and the Poisson distribution are studied in more detail in the chapter on the Poisson process.
Suppose that at an absented-minded secretary prepares 4 letters and matching envelopes to send to 4 different persons, but then stuffs the letters into the envelopes randomly. Find the probability of the event \(A\) that at least one letter is in the proper envelope.
Note first that the sample space can be taken to be the set of permutations of \(\{1, 2, 3, 4\}\). For \(\bs{x} \in S\), \(x_i\) is the number of the envelope containing the \(i\)th letter. Clearly \(S\) should be given the uniform probability distribution. Next note that \(A = A_1 \cup A_2 \cup A_3 \cup A_4\) where \(A_i\) is the event that the \(i\)th letter is inserted into the \(i\)th envelope. Using the inclusion-exclusion rule gives \(\P(A) = \frac{5}{8}\).
This exercise is an example of the matching problem, originally formulated and studied by Pierre Remond Montmort. A complete analysis of the matching problem is given in the chapter on Finite Sampling Models.
In the simulation of the matching experiment select \(n = 4\). Run the experiment 1000 times and compute the relative frequency of the event that at least one match occurs.
For the M&M data set, let \(R\) denote the event that a bag has at least 10 red candies, \(T\) the event that a bag has at least 57 candies total, and \(W\) the event that a bag weighs at least 50 grams. Find the empirical probability the following events:
For the cicada data, let \(W\) denote the event that a cicada weighs at least 0.20 grams, \(F\) the event that a cicada is female, and \(T\) the event that a cicada is type tredecula. Find the empirical probability of each of the following: