\(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\bs}{\boldsymbol}\)
  1. Virtual Laboratories
  2. 1. Probability Spaces
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. Answers

3. Probability Measure

Definitions and Interpretations

Suppose that we have a random experiment with sample space \(S\). Intuitively, the probability of an event is a measure of how likely the event is to occur when we run the experiment.

Axioms

Mathematically, a probability measure (or distribution) \(\P\) for a random experiment is a real-valued function, defined on the collection of events, and satisfying the following axioms:

  1. \(\P(A) \ge 0\) for every event \(A\).
  2. \(\P(S) = 1\).
  3. If \(\{A_i: i \in I\}\) is a countable, pairwise disjoint collection of events then \[\P\left(\bigcup_{i \in I} A_i\right) = \sum_{i \in I}\P(A_i)\]

Axiom 3 is known as countable additivity, and states that the probability of a union of a finite or countably infinite collection of disjoint events is the sum of the corresponding probabilities. The axioms are known as the Kolmogorov axioms, in honor of Andrei Kolmogorov.

Axioms 1 and 2 are really just a matter of convention; we choose to measure the probability of an event with a number between 0 and 1 (as opposed, say, to a number between \(-5\) and \(7\)). Axiom 3 however, is fundamental and inescapable. It is required for probability for precisely the same reason that it is required for other measures of the size of a set, such as

In all these cases, the size of a set that is composed of countably many disjoint pieces is the sum of the sizes of the pieces. For more on general measures, see the section on Measure Theory.

Union of disjoint events

On the other hand, uncountable additivity (the extension of axiom 3 to an uncountable index set \(I\)) is unreasonable for probability, just as it is for other measures. For example, an interval of positive length in \(\R\) is a union of uncountably many points, each of which has length 0.

We now have defined the three essential ingredients for the model a random experiment:

  1. A sample space \(S\)
  2. A collection of events \(\mathscr{S}\)
  3. A probability measure \(\P\)

Together these define a probability space \((S, \mathscr{S}, \P)\).

The Law of Large Numbers

Intuitively, the probability of an event is supposed to measure the long-term relative frequency of the event--in fact, this concept was taken as the definition of probability by Richard Von Mises. Specifically, suppose that we repeat the experiment indefinitely. (Note that this actually creates a new, compound experiment.) For an event \(A\) in the basic experiment, let \(N_n(A)\) denote the number of times \(A\) occurred (the frequency of \(A\)) in the first \(n\) runs. (Note that this is a random variable in the compound experiment.) Thus,

\[P_n(A) = \frac{N_n(A)}{n}\]

is the relative frequency of \(A\) in the first \(n\) runs (it is also a random variable in the compound experiment). If we have chosen the correct probability measure for the experiment, then in some sense we expect that the relative frequency of each event should converge to the probability of the event:

\[P_n(A) \to \P(A) \text{ as } n \to \infty\]

The precise statement of this is the law of large numbers or law of averages, one of the fundamental theorems in probability. To emphasize the point, note that in general there will be lots of possible probability measures for an experiment, in the sense of the axioms. However, only the true probability measure will satisfy the law of large numbers.

It follows that if we have the data from \(n\) runs of the experiment, the observed relative frequency \(P_n(A)\) can be used as an approximation for \(\P(A)\); this approximation is called the empirical probability of \(A\).

The function \(P_n\) satisfies the axioms of a probability measure (given the data from \(n\) runs of the experiment).

The Distribution of a Random Variable

Suppose that \(X\) is a random variable for the experiment, taking values in a set \(T\).

The function \(B \mapsto \P(X \in B)\) defines a probability measure on \(T\).

Proof:

Recall that the inverse image preserves all set operations.

Random variable

The probability measure in the previous exercise is called the probability distribution of \(X\). Thus, any random variable \(X\) for an experiment defines a new probability space:

  1. A set of outcomes \(T\) (a set that includes the possible values of \(X\))
  2. A collection of events \(\mathscr{T}\) (the admissible subsets of \(T\))
  3. A probability measure on these events (the probability distribution of \(X\))

Moreover, recall that the outcome of the experiment itself can be thought of as a random variable. Specifically, if we take \(X\) to be the identity function on \(S\), then \(X\) is a random variable and

\[\P(X \in A) = \P(A)\]

Thus, any probability measure can be thought of as the distribution of a random variable.

Constructions

Measures

How can we construct probability measures? As noted briefly above, there are other measures of the size of sets; in many cases, these can be converted into probability measures. First, a (nonnegative) measure \(\mu\) on \(S\) is a real-valued function defined on the collection of events \(\mathscr{S}\) that satisfies axioms 1 and 3 above. In general, \(\mu(A)\) is allowed to be infinite for a subset \(A\). However, if \(\mu(S)\) is positive and finite, then \(\mu\) can easily be re-scaled into a probability measure.

If \(\mu\) is a measure on \(S\) with \(0 \lt \mu(S) \lt \infty\) then \(\P\) defined below is a probability measure on \(S\).

\[\P(A) = \frac{\mu(A)}{\mu(S)}, \; A \subseteq S\]

In the context of Exercise 3, \(\mu(S)\) is called the normalizing constant. In the next two subsections, we consider some very important special cases.

Discrete Distributions

Suppose that \(S\) is a finite, nonempty set. Clearly, counting measure \(\#\) is a finite measure on \(S\):

\[\#(A) = \text{ the number of elements in } A, \; A \subseteq S \]

The corresponding probability measure is called the discrete uniform distribution on \(S\), and is particularly important in combinatorial and sampling experiments:

\[\P(A) = \frac{\#(A)}{\#(S)}, \; A \subseteq S\]

We can give a more general construction for countable sample spaces that can be used to define many probability measures.

Suppose that \(S\) is nonempty and countable, and that \( g: S \to [0, \infty) \). Then \(\mu\) defined below is a measure on \(S\):

\[\mu(A) = \sum_{x \in A} g(x), \; A \subseteq S\]

Thus, if \(0 \lt \mu(S) \lt \infty\) then \(\P(A) = \mu(A) / \mu(S)\) defines a probability measure by Exercise 3. Distributions of this type are said to be discrete. Discrete distributions are studied in detail in the chapter on Distributions.

In the setting of previous exercise, if \(S\) is finite and \(g\) is a constant function, then the corresponding probability measure \(\P\) is the discrete uniform distribution on \(S\).

Continuous Distributions

We define the standard \(n\)-dimensional measure \(\lambda_n\) on \(\R^n\) (called Lebesgue measure, in honor of Henri Lebesgue) by

\[\lambda_n(A) = \int_A 1 d \bs{x}\]

Technically, the integral is more general than the one defined in calculus. However, the standard calculus integral will suffice for most of this project. In particular, we assume that set \(A\) is nice enough that the integral exists; see the section on Measurability for more details. Note that if \(n \gt 1\), the integral above is a multiple integral; \(\bs{x} = (x_1, x_2, \ldots, x_n)\) and \(d \bs{x} = dx_1 dx_2 \cdots dx_n\). The countable additivity axiom holds because of a basic property of integrals. In particular, note from calculus that

  1. \(\lambda_1(A)\) is the length of \(A\) for \(A \subseteq \R\).
  2. \(\lambda_2(A)\) is the area of \(A\) for \(A \subseteq \R^2\).
  3. \(\lambda_3(A)\) is the volume of \(A\) for \(A \subseteq \R^3\).

Now, if \(S \subseteq \R^n\) with \(0 \lt \lambda_n(S) \lt \infty\), then

\[\P(A) = \frac{\lambda_n(A)}{\lambda_n(S)}, \; A \subseteq S\]

is a probability measure on \(S\) by Exercise 3, called the continuous uniform distribution on \(S\).

We can generalize this construction to produce many other distributions. Suppose that \( g: S \to [0, \infty) \). Define

\[\mu(A) = \int_A g(x) dx, \; A \subseteq S\]

Then \(\mu\) is a measure on \(S\). Thus if \(0 \lt \mu(S) \lt \infty\), then \(\P(A) = \mu(A) / \mu(S)\) defines a probability measure as in Exercise 3. Distributions of this type are said to be continuous. Continuous distributions are studied in detail in the chapter on Distributions.

It is important to note again that, unlike many other areas of mathematics, the low-dimensional spaces (\(n = 1, 2, 3\)) do not play a special role, except for exposition. For example in the Cicada data, some of the variables recorded are body weight, body length, wing width, and wing length. A probability model for these variables would specify a distribution on a subset of \(\R^4\).

Rules of Probability

Basic Rules

Suppose that we have a random experiment with sample space \(S\) and probability measure \(\P\). In the following exercises, \(A\) and \(B\) are events. The following results follow from the axioms of probability.

\(\P(A^c) = 1 - \P(A)\). This is known as the complement rule.

Proof:

Note that \(A\) and \(A^c\) are disjoint and their union is \(S\).

Event A and its complement

\(\P(\emptyset) = 0\).

Proof:

This follows from the the complement rule applied to \(A = S\).

\(\P(B \setminus A) = \P(B) - \P(A \cap B)\). This is known as the difference rule.

Proof:

Note that \(A \cap B\) and \(B \setminus A\) are disjoint and their union is \(B\).

Events A and B

If \(A \subseteq B\) then \(\P(B \setminus A) = \P(B) - \P(A)\).

Proof:

This result is a corollary of the difference rule. Note that \(A \cap B = A\).

If \(A \subseteq B\) then \(\P(A) \le \P(B)\).

Proof:

This result is a corollary of the previous result. Note that \( \P(B \setminus A) \ge 0 \) and hence \( \P(B) - \P(A) \ge 0 \).

Thus, \(\P\) is an increasing function, relative to the subset partial order on the collection of events, and the ordinary order on \(\R\). In particular, it follows that \(\P(A) \le 1\) for any event \(A\).

Event A is a subset of event B

Suppose that \(A \subseteq B\).

  1. If \(\P(B) = 0\) then \(\P(A) = 0\).
  2. If \(\P(A) = 1\) then \(\P(B) = 1\).

Boole's Inequality

The inequality in the next exercise is known as Boole's inequality, named after George Boole. The inequality gives a simple upper bound on the probability of a union.

If \(\{A_i: i \in I\}\) is a countable collection of events then

\[\P\left( \bigcup_{i \in I} A_i \right) \le \sum_{i \in I} \P(A_i)\]
Proof:

Without loss of generality, we can assume that \(I = \{1, 2, \ldots\}\). Define \(B_1 = A_1\) and \(B_n = A_n \setminus \left(\bigcup_{i=1}^{n-1} A_i\right) \) for \(n \in \{2, 3, \ldots\}\). Note that \(\{B_1, B_2, \ldots\}\) is a pairwise disjoint collection and has the same union as \(\{A_1, A_2, \ldots\}\). Note also that \( B_i \subseteq A_i \) for each \( i \). Thus, from the additivity axiom and the increasing property,

\[ \P\left(\bigcup_{i=1}^\infty A_i\right) = \P\left(\bigcup_{i=1}^\infty B_i\right) = \sum_{i=1}^\infty \P(B_i) \le \sum_{i=1}^\infty \P(A_i) \]
Boole's inequality

Intuitively, Boole's inequality holds because parts of the union have been measured more than once in the sum of the probabilities on the right. Of course, the sum of the probabilities may be greater than 1, in which case Boole's inequality is not helpful. The following result is a simple consequence of Boole's inequality.

If \(\{A_i: i \in I\}\) is a countable collection of events with \(\P(A_i) = 0\) for each \(i \in I\), then

\[\P\left( \bigcup_{i \in I} A_i \right) = 0\]

An event \(A\) with \(\P(A) = 0\) is said to be null. Thus, a countable union of null events is still a null event.

Bonferroni's Inequality

The inequality in the following exercise is known as Bonferroni's inequality, named after Carlo Bonferroni. The inequality gives a simple lower bound for the probability of an intersection.

If \(\{A_i: i \in I\}\) is a countable collection of events then

\[\P\left( \bigcap_{i \in I} A_i \right) \ge 1 - \sum_{i \in I}[1 - \P(A_i)]\]
Proof:

By De Morgan's law, \( \left(\bigcap_{i \in I} A_i\right)^c = \bigcup_{i \in I} A_i^c \). Hence by Boole's inequality,

\[ \P\left(\bigcap_{i \in I} A_i\right)^c \le \sum_{i \in I} \P(A_i^c) = \sum_{i \in I} [1 - \P(A_i)] \]

Using the complement rule again gives Bonferroni's inequality.

Of course, the lower bound in Bonferroni's inequality may be less than or equal to 0, in which case it's not helpful. The following result is a simple consequence of Bonferroni's inequality.

If \(\{A_i: i \in I\}\) is a countable collection of events with \(\P(A_i) = 1\) for each \(i \in I\), then

\[\P\left( \bigcap_{i \in I} A_i \right) = 1\]

An event \(A\) with \(\P(A) = 1\) is sometimes called almost sure or almost certain. Thus, a countable intersection of almost sure events is still almost sure.

Suppose that \(A\) and \(B\) are events in an experiment.

  1. If \(\P(A) = 0\), then \(\P(A \cup B) = \P(B)\).
  2. If \(\P(A) = 1\), then \(\P(A \cap B) = \P(B)\).

Computing Probabilities via a Partition

Suppose that \(\{A_i: i \in I\}\) is a countable collection of events that partition the sample space \(S\). For any event \(B\),

\[\P(B) = \sum_{i \in I} \P(A_i \cap B)\]
Image: Total.png

Naturally, this result is useful when the probabilities of the intersections are known. Partitions usually arise in connection with a random variable. Suppose that \(X\) is a random variable taking values in a countable set \(T\), and that \(B\) is an event. Then

\[\P(B) = \sum_{x \in T} \P(X = x, B)\]

In this formula, note that the comma acts like the intersection symbol in the previous formula.

The Inclusion-Exclusion Formula

The inclusion-exclusion formulas provide a method for computing the probability of a union of events in terms of the probabilities of the intersections of the events. The formula is useful because often the probabilities of the intersections are easier to compute.

\(\P(A \cup B) = \P(A) + \P(B) - \P(A \cap B)\) for any events \(A\) and \(B\).

Proof:

Note first that \(A \cup B = A \cup (B \setminus A)\) and the latter two sets are disjoint. Now use the additivity axiom and the difference rule.

Events A and B

\(\P(A \cup B \cup C) = \P(A) + \P(B) + \P(C) - \P(A \cap B) - \P(A \cap C) - \P(B \cap C) + \P(A \cap B \cap C)\) for any events \(A\), \(B\), and \(C\).

Proof:

First note that \( A \cup B \cup C = (A \cup B) \cup [C \setminus (A \cup B)] \). The event in parentheses and the event in square brackets are disjoint. Thus, using the additivity axiom and the difference rule,

\[ \P(A \cup B \cup C) = \P(A \cup B) + \P(C) - \P[C \cap (A \cup B)] = \P(A \cup B) + \P(C) - \P[(C \cap A) \cup (C \cap B)] \]

Using the inclusion-exclusion rule for two events (twice) we have

\[ \P(A \cup B \cup C) = \P(A) + \P(B) - \P(A \cap B) + \P(C) - [\P(C \cap A) + \P(C \cap B) - \P(A \cap B \cap C)] \]
Events A, B, and C

The last two exercises can be generalized to a union of \(n\) events; the generalization is known as the inclusion-exclusion formula

Suppose that \(A_i\) is an event for each \(i \in I\) where \(\#(I) = n\). Then

\[\P\left( \bigcup_{i \in I} A_i \right) = \sum_{k = 1}^n (-1)^{k - 1} \sum_{J \subseteq I, \; \#(J) = k} \P\left( \bigcap_{j \in J} A_j \right)\]
Proof:

The proof is by induction on \(n\). We have already established the formula for \( n = 2 \) and \( n = 3 \). Thus, suppose that the inclusion-exclusion formula holds for a given \( n \), and suppose that \( (A_1, A_2, \ldots, A_{n+1}) \) is a sequence of \( n + 1 \) events. Then

\[ \bigcup_{i=1}^{n + 1} A_i = \left(\bigcup_{i=1}^n A_i \right) \cup \left[ A_{n+1} \setminus \left(\bigcup_{i=1}^n A_i\right) \right] \]

As before, the event in parentheses and the event in square brackets are disjoint. Thus using the additivity axiom, the difference rule, and the distributive rule we have

\[ \P\left(\bigcup_{i=1}^{n+1} A_i\right) = \P\left(\bigcup_{i=1}^n A_i\right) + \P(A_{n+1}) - \P\left(\bigcup_{i=1}^n (A_{n+1} \cap A_i) \right) \]

By the induction hypothesis, the inclusion-exclusion formula holds for each union of \( n \) events on the right. Applying the formula and simplifying gives the inclusion-exclusion formula for \( n + 1 \) events.

The general Bonferroni inequalities state that if sum on the right in Exercise 20 is truncated after \(k\) terms (\(k \lt n\)) then the truncated sum is an upper bound for the probability of the union if \(k\) is odd (so that the last term has a positive sign) and is a lower bound for the probability of the union if \(k\) is even (so that the last terms has a negative sign).

If you go back and look at your proofs of the basic properties in Exercises 6-20, you will see that they hold for any finite measure \(\mu\), not just probability. The only change is that the number 1 is replaced by \(\mu(S)\). In particular, the inclusion-exclusion rule is as important in combinatorics (the study of counting measure) as it is in probability.

Examples and Applications

Basic Rules

Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A) = \frac{1}{3}\), \(\P(B) = \frac{1}{4}\), \(\P(A \cap B) = \frac{1}{10}\). Express each of the following events in the language of the experiment and find its probability:

  1. \(A \setminus B\)
  2. \(A \cup B\)
  3. \(A^c \cup B^c\)
  4. \(A^c \cap B^c\)
  5. \(A \cup B^c\)

Suppose that \(A\), \(B\), and \(C\) are events in an experiment with \(\P(A) = 0.3\), \(\P(B) = 0.2\), \(\P(C) = 0.4\), \(\P(A \cap B) = 0.04\), \(\P(A \cap C) = 0.1\), \(\P(B \cap C) = 0.1\), \(\P(A \cap B \cap C) = 0.01\). Express each of the following events in set notation and find its probability:

  1. At least one of the three events occurs.
  2. None of the three events occurs.
  3. Exactly one of the three events occurs.
  4. Exactly two of the three events occur.

Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A \setminus B) = \frac{1}{6}\), \(\P(B \setminus A) = \frac{1}{4}\), and \(\P(A \cap B) = \frac{1}{12}\). Find the probability of each of the following events:

  1. \(A\)
  2. \(B\)
  3. \(A \cup B\)
  4. \(A^c \cup B^c\)
  5. \(A^c \cap B^c\)

Suppose that \(A\) and \(B\) are events in an experiment with \(\P(A) = \frac{2}{5}\), \(\P(A \cup B) = \frac{7}{10}\), and \(\P(A \cap B) = \frac{1}{6}\). Find the probability of each of the following events:

  1. \(B\)
  2. \(A \setminus B\)
  3. \(B \setminus A\)
  4. \(A^c \cup B^c\)
  5. \(A^c \cap B^c\)

Suppose that \(A\), \(B\), and \(C\) are events in an experiment with \(\P(A) = \frac{1}{3}\), \(\P(B) = \frac{1}{4}\), \(\P(C) = \frac{1}{5}\).

  1. Use Boole's inequality to find an upper bound for \(\P(A \cup B \cup C)\).
  2. Use Bonferronis's inequality to find a lower bound for \(\P(A \cap B \cap C)\).

Coins

Consider the random experiment of tossing a coin \(n\) times and recording the sequence of scores \(\bs{X} = (X_1, X_2, \ldots, X_n)\) (where 1 denotes heads and 0 denotes tails). This experiment is a generic example of \(n\) Bernoulli trials, named for Jacob Bernoulli. Note that the sample space is \(\{0, 1\}^n\), the set of bit strings of length \(n\). If the coin is fair, then presumably, by the very meaning of fair, we have no reason to prefer one point in \(S\) over another. Thus, as a modeling assumption, it seems reasonable to give \(S\) the uniform probability distribution in which all outcomes are equally likely.

Suppose that a fair coin is tossed 3 times and the sequence of coin scores is recorded. Let \(A\) be the event that the first coin is heads and \(B\) the event that there are exactly 2 heads. Give each of the following events in list form, and then compute the probability of the event:

  1. \(A\)
  2. \(B\)
  3. \(A \cap B\)
  4. \(A \cup B\)
  5. \(A^c \cup B^c\)
  6. \(A^c \cap B^c\)
  7. \(A \cup B^c\)

In the Coin experiment, select 3 coins. Run the experiment 1000 times, updating after every run, and compute the empirical probability of each event in the previous exercise.

Suppose that a fair coin is tossed 4 times and the sequence of scores is recorded. Let \(Y\) denote the number of heads. Give the event \(\{Y = k\}\) (as a subset of the sample space) in list form, for each \(k \in \{0, 1, 2, 3, 4\}\), and then give the probability of the event.

Suppose that a fair coin is tossed \(n\) times and the sequence of scores is recorded. Let \(Y\) denote the number of heads.

\[\P(Y = k) = \binom{n}{k} \left( \frac{1}{2} \right)^n, \quad k \in \{0, 1, \ldots, n\}\]
Proof:

Recall that the number of bit strings of length \(n\) with exactly \(k\) 1's is \(\binom{n}{k}\).

The distribution of \(Y\) in the last exercise is a special case of the binomial distribution. The binomial distribution is studied in more detail in the chapter on Bernoulli Trials.

Dice

Consider the experiment of throwing \(n\) distinct, \(k\)-sided dice (with sides numbered from 1 to \(k\)) and recording the sequence of scores \(\bs{X} = (X_1, X_2, \ldots, X_n)\). We can record the outcome as a sequence because of the assumption that the dice are distinct; you can think of the dice as somehow labeled from 1 to \(n\). The special case \(k = 6\) corresponds to standard dice. In general, note that the sample space is \(S = \{1, 2, \ldots, k\}^n\). If the dice are fair, then again, by the very meaning of fair, we have no reason to prefer one point in the sample space over another, so as a modeling assumption it seems reasonable to give \(S\) the uniform probability distribution.

Suppose that two fair, standard dice are thrown and the sequence of scores recorded. Let \(A\) denote the event that the first die score is less than 3 and \(B\) the event that the sum of the dice scores is 6. Give each of the following events in list form and then find the probability of the event.

  1. \(A\)
  2. \(B\)
  3. \(A \cap B\)
  4. \(A \cup B\)
  5. \(B \setminus A\)

In the dice experiment, set \(n = 2\). Run the experiment 100 times and compute the empirical probability of each event in the previous exercise.

Consider again the dice experiment with \(n = 2\) fair dice. Let \(Y\) denote the sum of the scores, \(U\) the minimum score, and \(V\) the maximum score.

  1. Express \(Y\) as a function on the sample space and give the set of values.
  2. Find \(\P(Y = y)\) for each \(y\) in the set in part (a).
  3. Express \(U\) as a function on the sample space and give the set of values.
  4. Find \(\P(U = u)\) for each \(u\) in the set in part (c).
  5. Express \(V\) as a function on the sample space and give the set of values.
  6. Find \(\P(V = v)\) for each \(v\) in the set in part (e).
  7. Find the set of values of \((U, V)\).
  8. Find \(\P(U = u, V = v)\) for each \((u, v)\) in the set in part (g).

In the previous exercise, note that \((U, V)\) could serve as the outcome vector for the experiment of rolling two standard, fair dice if we do not bother to distinguish the dice (so that we might as well record the smaller score first and then the larger score). Note that this random vector does not have a uniform distribution. On the other hand, we might have chosen at the beginning to just record the unordered set of scores and, as a modeling assumption, imposed the uniform distribution on the corresponding sample space. Both models cannot be right, so which model (if either) describes real dice in the real world? It turns out that for real (fair) dice, the ordered sequence of scores is uniformly distributed, so real dice behave as distinct objects, whether you can tell them apart or not. In the early history of probability, gamblers sometimes got the wrong answers for events involving dice because they mistakenly applied the uniform distribution to the sample space of unordered scores. It's an important moral. If we are to impose the uniform distribution on a sample space, we need to make sure that it's the right sample space.

A pair of fair, standard dice are thrown repeatedly until the sum of the scores is either 5 or 7. Let \(A\) denote the event that the sum of the scores on the last throw is 5 rather than 7. Events of this type are important in the game of craps.

  1. Suppose that we record the pair of scores on each throw. Define an appropriate sample space.
  2. Represent \(A\) in terms of the sample space in (a).
  3. Compute the probability of \(A\) in the setting of parts (a) and (b).
  4. Now suppose that we just record the pair of scores on the last throw. Define an appropriate sample space.
  5. Argue that the sample space in part (d) should be given the uniform distribution.
  6. Compute the probability of \(A\) in the setting of parts (d) and (e).

The previous problem shows the importance of defining the sample space appropriately. Sometimes a clever choice of the sample space can turn a difficult problem into an easy one.

Sampling Models

Recall that many random experiments can be thought of as sampling experiments. For the general finite sampling model, we start with a population \(D\) with \(m\) (distinct) objects. We select a sample of \(n\) objects from the population, so that the sample space \(S\) is the set of possible samples. If we select a sample at random then the outcome \(\bs{X}\) (the random sample) is uniformly distributed on \(S\):

\[\P(\bs{X} \in A) = \frac{\#(A)}{\#(S)}, \quad A \subseteq S\]

Recall that there are four common types of sampling from a finite population, based on the criteria of order and replacement.

If we sample with replacement, the sample size \(n\) can be any positive integer. If we sample without replacement, the sample size cannot exceed the population size, so we must have \(n \in \{1, 2, \ldots, m\}\).

The basic coin and dice experiments are examples of sampling with replacement. If we toss a fair coin \(n\) times and record the sequence of scores \(\bs{X}\) (where as usual, 0 denotes tails and 1 denotes heads), then \(\bs{X}\) is a random sample of size \(n\) chosen with order and with replacement from the population \(\{0, 1\}\). Thus, \(\bs{X}\) is uniformly distributed on \(\{0, 1\}^n\). If we throw \(n\) (distinct) standard fair dice and record the sequence of scores, then we generate a random sample \(\bs{X}\) of size \(n\) with order and with replacement from the population \(\{1, 2, 3, 4, 5, 6\}\). Thus, \(\bs{X}\) is uniformly distributed on \(\{1, 2, 3, 4, 5, 6\}^n\). Of an analogous result would hold for fair, \(k\)-sided dice.

Suppose that the sampling is without replacement (the most common case). If we record the ordered sample \(\bs{X} = (X_1, X_2, \ldots, X_n)\), then the unordered sample \(\bs{W} = \{X_1, X_2, \ldots\}\) is a random variable (that is, a function of \(\bs{X}\)). On the other hand, if we just record the unordered sample \(\bs{W}\) in the first place, then we cannot recover the ordered sample.

Suppose that \(\bs{X}\) is a random sample of size \(n\) chosen with order and without replacement from \(D\), so that \(\bs{X}\) is uniformly distributed on the space of permutations of size \(n\) from \(D\). Then \(\bs{W}\), the unordered sample, is uniformly distributed on the space of combinations of size \(n\) from \(D\). Thus, \(\bs{W}\) is also a random sample.

Proof:

Note that for every unordered sample of size \(n\), there are \(n!\) ordered samples.

The result in the last exercise does not hold if the sampling is with replacement (recall Exercise 32 and the discussion that follows). When sampling with replacement, there is no simple relationship between the number of ordered samples and the number of unordered samples.

Sampling From a Dichotomous Population

Suppose again that we have a population \(D\) with \(m\) (distinct) objects, but suppose now that each object is one of two types--either type 1 or type 0. Such populations are said to be dichotomous. Here are some specific examples:

Suppose that the population \(D\) has \(r\) type 1 objects and hence \(m - r\) type 0 objects. Of course, we must have \(r \in \{0, 1, \ldots, m\}\). Now suppose that we select a sample of size \(n\) at random from the population. Note that this model has three parameters: the population size \(m\), the number of type 1 objects in the population \(r\), and the sample size \(n\). Let \(Y\) denote the number of type 1 objects in the sample.

If the sampling is without replacement then

\[\P(Y = y) = \frac{\binom{r}{y} \binom{m - r}{n - y}}{\binom{m}{n}}, \quad y \in \{0, 1, \ldots, n\}\]

In the previous exercise, random variable \(Y\) has the hypergeometric distribution with parameters \(m\), \(r\), and \(n\). The hypergeometric distribution is studied in more detail in the chapter on Findite Sampling Models.

If the sampling is with replacement then

\[\P(Y = y) = \binom{n}{y} \frac{r^y (m - r)^{n-y}}{m^n} = \binom{n}{y} \left( \frac{r}{m}\right)^y \left( 1 - \frac{r}{m} \right)^{n - y}, \quad y \in \{0, 1, \ldots, n\}\]

In the last exercise, random variable \(Y\) has the binomial distribution with parameters \(n\) and \(p = \frac{r}{m}\). The binomial distribution is studied in more detail in the chapter on Bernoulli Trials.

Suppose that a group of voters consists of 40 democrats and 30 republicans. A sample 10 voters is chosen at random. Find the probability that the sample contains at least 4 democrats and at least 4 republicans, each of the following cases:

  1. The sampling is without replacement.
  2. The sampling is with replacement.

Look for other specialized sampling situations in the exercises below.

Urn Models

Drawing balls from an urn is a standard metaphor in probability for sampling from a finite population.

Consider an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, without replacement. Let \(Y\) denote the number of red balls in the sample. Explicitly compute \(\P(Y = y)\) for each \(y \in \{0, 1, 2, 3, 4, 5\}\).

In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling without replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you computed in the previous exercise.

Consider again an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, with replacement. Let \(Y\) denote the number of red balls in the sample. Explicitly compute \(\P(Y = y)\) for each \(y \in \{0, 1, 2, 3, 4, 5\}\).

In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling with replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you computed in the previous exercise.

An urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, without replacement.

  1. Find the probability that the chosen balls are all the same color.
  2. Find the probability that the chosen balls are all different colors.

Suppose again that an urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, with replacement.

  1. Find the probability that the chosen balls are all the same color.
  2. Find the probability that the chosen balls are all different colors.

Cards

Recall that a standard card deck can be modeled by the product set

\[D = \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k\} \times \{\clubsuit, \diamondsuit, \heartsuit, \spadesuit\}\]

where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example \(q \heartsuit\)).

Card games involve choosing a random sample without replacement from the deck \(D\), which plays the role of the population. Thus, the basic card experiment consists of dealing \(n\) cards from a standard deck without replacement; in this special context, the sample of cards is often referred to as a hand. Just as in the general sampling model, if we record the ordered hand \(\bs{X} = (X_1, X_2, \ldots, X_n)\), then the unordered hand \(\bs{W} = \{X_1, X_2, \ldots, X_n\}\) is a random variable (that is, a function of \(\bs{X}\)). On the other hand, if we just record the unordered hand \(\bs{W}\) in the first place, then we cannot recover the ordered hand. Finally, recall that \(n = 5\) is the poker experiment and \(n = 13\) is the bridge experiment. The game of poker is treated in more detail in the chapter on Games of Chance. By the way, it takes about 7 standard riffle shuffles to randomize a deck of cards.

Suppose that 2 cards are dealt from a well-shuffled deck and the sequence of cards is recorded. For \(i \in \{1, 2\}\), let \(H_i\) denote the event that card \(i\) is a heart. Find the probability of each of the following events.

  1. \(H_1\)
  2. \(H_1 \cap H_2\)
  3. \(H_2 \setminus H_1\)
  4. \(H_2\)
  5. \(H_1 \setminus H_2\)
  6. \(H_1 \cup H_2\)

Think about the results in the previous exercise, and suppose that we continue dealing cards. Note that in computing the probability of \(H_i\), you could conceptually reduce the experiment to dealing a single card. Note also that the probabilities do not depend on the order in which the cards are dealt. For example, the probability of an event involving the 1st, 2nd and 3rd cards is the same as the probability of the corresponding event involving the 25th, 17th, and 40th cards. Technically, the cards are exchangeable. Here's another way to think of this concept: Suppose that the cards are dealt onto a table in some pattern, but you are not allowed to view the process. Then no experiment that you can devise will give you any information about the order in which the cards were dealt.

In the card experiment, set \(n = 2\). Run the experiment 100 times and compute the empirical probability of each event in the previous exercise

In the poker experiment, find the probability of each of the following events:

  1. The hand is a full house (3 cards of one kind and 2 cards of another kind).
  2. The hand has four of a kind (4 cards of one kind and 1 of another kind).
  3. The cards are all in the same suit (thus, the hand is either a flush or a straight flush).

Run the poker experiment 10000 times, updating every 10 runs. Compute the empirical probability of each event in the previous problem.

Find the probability that a bridge hand will contain no honor cards that is, no cards of denomination 10, jack, queen, king, or ace. Such a hand is called a Yarborough, in honor of the second Earl of Yarborough.

Find the probability that a bridge hand will contain

  1. Exactly 4 hearts.
  2. Exactly 4 hearts and 3 spades.
  3. Exactly 4 hearts, 3 spades, and 2 clubs.

A card hand that contains no cards in a particular suit is said to be void in that suit. Use the inclusion-exclusion rule to find the probability of each of the following events:

  1. A poker hand is void in at least one suit.
  2. A bridge hand is void in at least one suit.

Birthdays

The following problem is known as the birthday problem, and is famous because the results are rather surprising at first.

Suppose that \(n\) persons are selected and their birthdays recorded (we will ignore leap years). Let \(A\) denote the event that the birthdays are distinct, so that \(A^c\) is the event that there is at least one duplication in the birthdays.

  1. Note that the sample space is \(D^n\) where \(D\) is the set of days of the year.
  2. Note that \(\#(S) = 365^n\).
  3. Note that \(\#(A) = 365^{(n)}\).
  4. If we assume that \(S\) should be given the uniform distribution then \(\P(A) = 365^{(n)} / 365^n\),
  5. Compute the probability in (c) for each \(n \in \{10, 20, 30, 40, 50\}\).
  6. Comment on the assumption that \(S\) should be given the uniform distribution.

The small value of \(\P(A)\) for relatively small sample sizes \(n\) is striking, but is due mathematically to the fact that \(365^n\) grows much faster than \(365^{(n)}\) as \(n\) increases. The birthday problem is treated in much more generality in the chapter on Finite Sampling Models.

Suppose that 4 persons are selected and their birth months recorded. Assuming an appropriate uniform distribution, find the probability that the months are distinct.

Continuous Uniform Distributions

Recall that in Buffon's coin experiment, a coin with radius \(r \le \frac{1}{2}\) is tossed randomly on a floor with square tiles of side length 1, and the coordinates \((X, Y)\) of the center of the coin are recorded, relative to axes through the center of the square in which the coin lands (with the axes parallel to the sides of the square, of course). Let \(A\) denote the event that the coin does not touch the sides of the square.

  1. Define the sample space \(S\) mathematically and sketch \(S\).
  2. Argue that \((X, Y)\) is uniformly distributed on \(S\).
  3. Express \(A\) in terms of the outcome variables \((X, Y)\) and sketch \(A\).
  4. Find \(\P(A)\).
  5. Find \(\P(A^c)\).

In Buffon's coin experiment, set \(r = 0.2\). Run the experiment 100 times and compute the empirical probability of each event in the previous exercise.

Suppose that a dart board is a circular region \(S\) of radius 12 inches, and our experiment consists of throwing a dart at the target and recording the point \((X, Y)\) where the dart lands, relative to axes through the center of the board. Let's assume (unrealistically) that our outcome vector is uniformly distributed on \(S\). Let \(R\) denote the distance from the center of the board to the point where the dart lands. Sketch each of the following events and compute the probability of the event.

  1. \(\{R \le 3\}\)
  2. \(\{3 \lt R \le 6\}\)
  3. \(\{6 \lt R \le 9\}\)
  4. \(\{9 \lt R\le 12\}\)

Is \(R\) uniformly distributed on the interval \([0, 12]\)?

Genetics

First, let's consider an overly simplified model of an inherited trait that has two possible states, for example a pea plant whose pods are either green or yellow. A plant has two genes for the trait (one from each parent), so the possible genotypes are

The genotypes \(gg\) and \(yy\) are called homozygotes, while the genotype \(gy\) is called heterozygote. Typically, one of the states of the inherited trait is dominant and the other recessive. Thus, for example, if green is the dominant state for pod color, then a plant with genotype \(gg\) or \(gy\) has green pods, while a plant with genotype \(yy\) has yellow pods. Finally, the gene passed from a parent to a child is randomly selected from the parent's two genes. The inheritance of pea pod color was studied by Gregor Mendel, the father of modern genetics.

Let \(G\) be the event that a child plant has green pods. Find \(\P(G)\) in each of the following cases:

  1. At least one parent is type \(gg\).
  2. Both parents are type \(yy\).
  3. Both parents are type \(gy\).
  4. One parent is type \(yy\) and the other is type \(gy\).

A sex-linked hereditary disorder is a disorder due to a defect on the X chromosome (one of the two chromosomes that determine gender). Suppose that \(h\) denotes the healthy gene and \(d\) the defective gene linked to the disorder. Women have two X chromosomes, and \(d\) is recessive. Thus, a woman with genotype \(hh\) is completely normal with respect to the condition; a woman with genotype \(hd\) does not have the disorder, but is a carrier, since she can pass the defective gene to her children; and a woman with genotype \(dd\) has the disorder. A man has only one X chromosome (his other sex chromosome, the Y chromosome, typically plays no role in the disorder). A man with genotype \(h\) is normal and a man with genotype \(d\) has the disorder. Examples of sex-linked hereditary disorders are dichromatism, the most common form of color-blindness, and the most common form of hemophilia, a bleeding disorder. The following exercise explore the transmission of a sex-linked hereditary disorder.

Let \(B\) be the event that a son has the disorder, \(C\) the event that a daughter is a healthy carrier, and \(D\) the event that a daughter has the disease. Find \(\P(B)\), \(\P(C)\) and \(\P(D)\) in each of the following cases:

  1. The mother and father are normal.
  2. The mother is a healthy carrier and the father is normal.
  3. The mother is normal and the father has the disorder.
  4. The mother is a healthy carrier and the father has the disorder.
  5. The mother has the disorder and the father is normal.
  6. The mother and father both have the disorder.

From this exercise, note that transmission of the disorder to a daughter can only occur if the mother is at least a carrier and the father has the disorder. In ordinary large populations, this is a unusual intersection of events, and thus sex-linked hereditary disorders are typically much less common in women than in men. In brief, women are protected by the extra X chromosome.

Radioactive Emissions

Suppose that \(T\) denotes the time between emissions (in milliseconds) for a certain type of radioactive material, and that \(T\) has the following probability distribution:

\[\P(T \in A) = \int_A e^{-t} dt, \quad A \subseteq (0, \infty)\]
  1. Show that this really does define a probability distribution.
  2. Find \(\P(T \gt 3)\) .
  3. Find \(\P(2 \lt T \lt 4)\).

Suppose that \(N\) denotes the number of emissions in a one millisecond interval for a certain type of radioactive material, and that \(N\) has the following probability distribution:

\[\P(N \in A) = \sum_{n \in A} \frac{e^{-1}}{n!}, \quad A \subseteq \N\]
  1. Show that this really does define a probability distribution.
  2. Find \(\P(N \ge 3)\).
  3. Find \(\P(2 \le N \le 4)\).

The probability distribution defined in the Exercise 58 is a special case of the exponential distribution, while the probability distribution defined in Exercise 59 is a special case of the Poisson distribution, named for Simeon Poisson. The exponential distribution and the Poisson distribution are studied in more detail in the chapter on the Poisson process.

Matching

Suppose that at an absented-minded secretary prepares 4 letters and matching envelopes to send to 4 different persons, but then stuffs the letters into the envelopes randomly. Find the probability of the event \(A\) that at least one letter is in the proper envelope.

  1. Note first that the sample space can be taken to be the set of permutations of \(\{1, 2, 3, 4\}\). For \(\bs{x} \in S\), \(x_i\) is the number of the envelope containing the \(i\)th letter.
  2. Argue that \(S\) should be given the uniform probability distribution.
  3. Next note that \(A = A_1 \cup A_2 \cup A_3 \cup A_4\) where \(A_i\) is the event that the \(i\)th letter is inserted into the \(i\)th envelope.
  4. Now use the inclusion-exclusion rule.

Exercise 60 is an example of the matching problem, originally formulated and studied by Pierre Remond Montmort. A complete analysis of the matching problem is given in the chapter on Finite Sampling Models.

In the simulation of the matching experiment select \(n = 4\). Run the experiment 1000 times and compute the relative frequency of the event that at least one match occurs.

Data Analysis Exercises

For the M&M data set, let \(R\) denote the event that a bag has at least 10 red candies, \(T\) the event that a bag has at least 57 candies total, and \(W\) the event that a bag weighs at least 50 grams. Find the empirical probability the following events:

  1. \(R\)
  2. \(T\)
  3. \(W\)
  4. \(R \cap T\)
  5. \(T \setminus W\)

For the cicada data, let \(W\) denote the event that a cicada weighs at least 0.20 grams, \(F\) the event that a cicada is female, and \(T\) the event that a cicada is type tredecula. Find the empirical probability of each of the following:

  1. \(W\)
  2. \(F\)
  3. \(T\)
  4. \(W \cap F\)
  5. \(F \cup T \cup W\)

Equivalence

Our last topic on equivalent events and random variables is a bit more theoretical and can be omitted if you are a new student of probability. Intuitively, equivalent events or random variables are those that are indistinguishable from a probabilistic point of view. The purpose of this subsection is to make this idea precise.

Recall that the symmetric difference between events \( A \) and \( B \) in a random experiment is \( A \bigtriangleup B = (A \setminus B) \cup (B \setminus A) \). Events \(A\) and \(B\) are said to be equivalent if the probability of the symmetric difference is 0:

\[\P[(A \bigtriangleup A)] = \P(A \setminus B) + \P(B \setminus A) = 0\]

This definition really does define an equivalence relation on the collection of events of a random experiment. Thus, the collection of events is partitioned into disjoint classes of mutually equivalent events.

  1. \(A\) is equivalent to \(A\) for any event \(A\) (the reflexive property).
  2. If \(A\) is equivalent to \(B\) then \(B\) is equivalent to \(A\) (the symmetric property).
  3. If \(A\) is equivalent to \(B\) and \(B\) is equivalent to \(C\) then \(A\) is equivalent to \(C\) (the transitive property).
Proof:

The reflexive and symmetric properties are trivial. For the transitive property, suppose that \( A \) is equivalent to \( B \) and \( B \) is equivalent to \( C \). Note that \( A \setminus C \subseteq (A \setminus B) \cup (B \setminus C) \), and hence \( \P(A \setminus C) = 0 \). By a symmetric argument, \( \P(C \setminus A) = 0 \).

If \( A \) and \( B \) are equivalent then \( A^c \) and \( B^c \) are equivalent.

Proof:

Note that \( A^c \setminus B^c = B \setminus A \) and \( B^c \setminus A^c = A \setminus B \), so \( A^c \bigtriangleup B^c = A \bigtriangleup B \).

Equivalent events have the same probability: if \(A\) and \(B\) are equivalent then \(\P(A) = \P(B)\).

Proof:

Note again that \( A = (A \cap B) \cup (A \setminus B) \). If \( A \) and \( B \) are equivalent, then \( \P(A) = \P(A \cap B) \). By a symmetric argument, \( \P(B) = \P(A \cap B) \).

The converse fails with a passion. Consider the simple experiment of tossing a fair coin. The event that the coin lands heads and the event that the coin lands tails have the same probability, but are not equivalent.

However, the null and almost sure events do form equivalence classes.

  1. Suppose that \(A\) is an event with \(\P(A) = 0\). Then \(B\) is equivalent to \(A\) if and only if \(\P(B) = 0\).
  2. Suppose that \(A\) is an event with \(\P(A) = 1\). Then \(B\) is equivalent to \(A\) if and only if \(\P(B) = 1\).
Proof:

For part (a), Suppose that \( A \) is an event with \( \P(A) = 0 \) and \( B \) is equivalent to \( A \). Then \( \P(B) = 0 \) by Theorem 66. Conversely, note that \( A \setminus B \subseteq A \) and \( B \setminus A \subseteq A \) so if \( \P(A) = \P(B) = 0 \) then \( \P(A \bigtriangleup B) = 0 \) so \( A \) and \( B \) are equivalent. Part (b) follows from part (a) and Theorem 65.

Now suppose that \(X\) and \(Y\) are random variables for an experiment, each taking values in a set \(T\). Then \(X\) and \(Y\) are said to be equivalent if

\[\P(X = Y) = 1\]

Again, this definition really does define an equivalence relation on the collection of random variables that take values in \(T\). Thus, the collection of such random variables is partitioned into disjoint classes of mutually equivalent variables.

  1. \(X\) is equivalent to \(X\) for any random variable \(X\) (the reflexive property).
  2. If \(X\) is equivalent to \(Y\) then \(Y\) is equivalent to \(X\) (the symmetric property).
  3. If \(X\) is equivalent to \(Y\) and \(Y\) is equivalent to \(Z\) then \(X\) is equivalent to \(Z\) (the transitive property).

Suppose that \(X\) and \(Y\) are equivalent random variables, taking values in a set \(T\). Then for any \(B \subseteq T\), the events \(\{X \in B\}\) and \(\{Y \in B\}\) are equivalent.

Proof:

Note that \( \{X \in B\} \bigtriangleup \{Y \in B\} \subseteq \{X \ne Y\} \).

Thus if \( X \) and \( Y \) are equivalent, then \(X\) and \(Y\) have the same probability distribution. Again, the converse fails with a passion.

Suppose that \(A\) and \(B\) are events for a random experiment. Then \(A\) and \(B\) are equivalent if and only if the indicator random variables \(\bs{1}_A\) and \(\bs{1}_B\) are equivalent.

Suppose that \(X\) and \(Y\) are equivalent random variables, taking values in a set \(T\), and that \(g\) is a function from \(T\) into a set \(U\). Then \(g(X)\) and \(g(Y)\) are equivalent.