\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\N}{\mathbb{N}}\)
\(\newcommand{\Z}{\mathbb{Z}}\)
\(\newcommand{\E}{\mathbb{E}}\)
\(\newcommand{\P}{\mathbb{P}}\)
\(\newcommand{\var}{\text{var}}\)
\(\newcommand{\sd}{\text{sd}}\)
\(\newcommand{\cov}{\text{cov}}\)
\(\newcommand{\cor}{\text{cor}}\)
\(\newcommand{\bias}{\text{bias}}\)
\(\newcommand{\mse}{\text{mse}}\)
\(\newcommand{\bs}{\boldsymbol}\)

Suppose again that we have an observable random variable \(\bs{X}\) for an experiment, that takes values in a set \(S\). Suppose also that distribution of \(\bs{X}\) depends on an unknown parameter \(\theta\), taking values in a parameter space \(\Theta\). Specifically, we will denote the probability density function of \(\bs{X}\) on \(S\) by \(f_\theta\) for \(\theta \in \Theta\). Of course, our data variable \(\bs{X}\) will almost always be vector-valued. The parameter \(\theta\) may also be vector-valued.

The likelihood function \(L\) is the function obtained by reversing the roles of \(\bs{x}\) and \(\theta\) in the probability density function; that is, we view \(\theta\) as the variable and \(\bs{x}\) as the given information (which is precisely the point of view in estimation): \[ L_\bs{x}(\theta) = f_\theta(\bs{x}); \quad \theta \in \Theta, \; \bs{x} \in S \] In the method of maximum likelihood, we try to find a value \(u(\bs{x})\) of the parameter \(\theta\) that maximizes \(L_\bs{x}(\theta)\) for each \(\bs{x} \in S\). If we can do this, then the statistic \(u(\bs{X})\) is called a maximum likelihood estimator of \(\theta\). The method is intuitively appealing—we try to find the values of the parameters that would have most likely produced the data we in fact observed.

Since the natural logarithm function is strictly increasing on \( (0, \infty) \), the maximum value of \(L_\bs{x}(\theta)\), if it exists, will occur at the same points as the maximum value of \(\ln\left[L_\bs{x}(\theta)\right]\). This latter function is called the log likelihood function and in many cases is easier to work with than the likelihood function (typically because the probability density function \(f_\theta(\bs{x})\) has a product structure).

An important special case is when \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_k)\) is a vector of \(k\) real parameters, so that \(\Theta \subseteq \R^k\). In this case, the maximum likelihood problem is to maximize a function of several variables. If \(\Theta\) is a continuous set, the methods of calculus can be used. If the maximum value of \(L_\bs{x}\) occurs at a point \(\bs{\theta}\) in the interior of \(\Theta\), then \(L_\bs{x}\) has a local maximum at \(\bs{\theta}\). Therefore, assuming that the likelihood function is differentiable, we can find this point by solving \[ \frac{\partial}{\partial \theta_i} L_\bs{x}(\bs{\theta}) = 0, \quad i \in \{1, 2, \ldots, k\} \] or equivalently \[ \frac{\partial}{\partial \theta_i} \ln\left[L_\bs{x}(\bs{\theta})\right] = 0, \quad i \in \{1, 2, \ldots, k\} \] On the other hand, the maximum value may occur at a boundary point of \(\Theta\), or may not exist at all.

Consider next the case where our outcome variable \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution of a random variable \(X\) taking values in \(R\), with probability density function \(g_\theta\) for \(\theta \in \Theta\). Then \(\bs{X}\) takes values in \(S = R^n\), and the joint probability density function of \(\bs{X}\) is the product of the marginal probability density functions. Thus, the likelihood function in this special case becomes \[ L_\bs{x}(\theta) = \prod_{i=1}^n g_\theta(x_i); \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in S, \; \theta \in \Theta \] and hence the log likelihood function becomes \[ \ln\left[L_\bs{x}(\theta)\right] = \sum_{i=1}^n \ln\left[g_\theta(x_i)\right]; \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in S, \; \theta \in \Theta \]

Returning to the general setting, suppose now that \(h\) is a one-to-one function from the parameter space \(\Theta\) onto a set \(\Lambda\). We can view \(\lambda = h(\theta)\) as a new parameter taking values in the space \(\Lambda\), and it is easy to re-parameterize the probability density function with the new parameter. Thus, let \[ \bar{f}_\lambda(\bs{x}) = f_{h^{-1}(\lambda)}(\bs{x}), \quad \bs{x} \in S, \; \lambda \in \Lambda \] The corresponding likelihood function is \[ \bar{L}_\bs{x}(\lambda) = L_\bs{x}\left[h^{-1}(\lambda)\right], \quad \lambda \in \Lambda, \; \bs{x} \in S \]

Suppose that \(u(\bs{x}) \in \Theta\) maximizes \(L_\bs{x}\) for \(\bs{x} \in S\). Then \(h\left[u(\bs{x})\right] \in \Lambda\) maximizes \(\bar{L}_\bs{x}\) for \(\bs{x} \in S\).

It follows from that if \(U\) is a maximum likelihood estimator for \(\theta\), then \(V = h(U)\) is a maximum likelihood estimator for \( \lambda = h(\theta) \). This result is known as the invariance property.

If the function \(h\) is not one-to-one, the maximum likelihood problem for the new parameter \(\lambda = h(\theta)\) is not well-defined, because we cannot parameterize the probability density function in terms of \(\lambda\). However, there is a natural generalization of the maximum likelihood problem in this case. Define \[ \bar{L}_\bs{x}(\lambda) = \max\left\{L_\bs{x}(\theta): \theta \in \Theta, \; h(\theta) = \lambda\right\}; \quad \lambda \in \Lambda, \; \bs{x} \in S \]

Suppose again that \(u(\bs{x}) \in \Theta\) maximizes \(L_\bs{x}\) for \(\bs{x} \in S\). Then \(h\left[u(\bs{x})\right] \in \Lambda\) maximizes \(\bar{L}_\bs{x}\) for \(\bs{x} \in S\).

The result in the last theorem extends the invariance property to many-to-one transformations of the parameter: if \(U\) is a maximum likelihood estimator for \(\theta\), then \(V = h(U)\) is a maximum likelihood estimator for \(\lambda = h(\theta)\).

In the following subsections, we will study maximum likelihood estimation for a number of special parametric families of distributions. Recall that if \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from a distribution with mean \(\mu\) and variance \(\sigma^2\), then the method of moments estimators of \(\mu\) and \(\sigma^2\) are, respectively, \begin{align} M & = \frac{1}{n} \sum_{i=1}^n X_i \\ T^2 & = \frac{1}{n} \sum_{i=1}^n (X_i - M)^2 \end{align} Of course, \(M\) is the sample mean, and \(T^2 = \frac{n - 1}{n} S^2\) where \(S^2\) is the standard sample variance. These statistics will also sometimes occur as maximum likelihood estimators.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with success parameter \(p\). Recall that the Bernoulli probability density function is \[ g(x) = p^x (1 - p)^{1 - x}, \quad x \in \{0, 1\} \] Thus, \(\bs{X}\) is a sequence of independent indicator variables with \(\P(X_i = 1) = p\) for each \(i\). In the usual language of reliability, \(X_i\) is the outcome of trial \(i\), where 1 means success and 0 means failure. Let \(Y = \sum_{i=1}^n X_i\) denote the number of successes, so that the proportion of successes (the sample mean) is \(M = Y / n\). Recall that \(Y\) has the binomial distribution with parameters \(n\) and \(p\).

Suppose that \(p\) varies in the interval \((0, 1)\). Then \(M\) is the maximum likelihood estimator of \(p\).

Note that \(\ln[g(x)] = x \ln(p) + (1 - x) \ln(1 - p)\). Hence the log-likelihood function is \[ \ln[L_{\bs{x}}(p)] = \sum_{i=1}^n [x_i \ln(p) - (1 - x_i) \ln(1 - p)] \] Differentiating with respect to \(p\) and simplifying gives \[ \frac{d}{dp} \ln[L_{\bs{x}}(p)] = \frac{y}{p} - \frac{n - y}{1 - p} \] where \(y = \sum_{i=1}^n x_i\). Thus, there is a single critical point at \(p = y / n = m\). The second deriviative is \[ \frac{d^2}{d p^2} \ln[L_{\bs{x}}(p)] = -\frac{y}{p^2} - \frac{n - 1}{(1 - p)^2} \lt 0 \] Hence the log-likelihood function is concave downward and so the maximum occurs at the unique critical point \(m\).

Recall that \(M\) is also the method of moments estimator of \(p\). It's always nice when two different estimation procedures yield the same result.

Suppose now that \(p\) takes values in \(\left\{\frac{1}{2}, 1\right\}\). Then the maximum likelihood estimator of \(p\) is the statistic \[ U = \begin{cases} 1, & Y = n\\ \frac{1}{2}, & Y \lt n \end{cases} \]

Note that the likelihood function is \(L_{\bs{x}}(p) = p^y (1 - p)^{n-y}\), where as usual, \(y = \sum_{i=1}^n x_i\). Thus \(L_{\bs{x}}\left(\frac{1}{2}\right) = \left(\frac{1}{2}\right)^y\). On the other hand, \(L_{\bs{x}}(1) = 0\) if \(y \lt n\) while \(L_{\bs{x}}(1) = 1\) if \(y = n\). Thus, if \(y = n\) the maximum occurs when \(p = 1\) while if \(y \lt n\) the maximum occurs when \(p = \frac{1}{2}\).

Note that the Bernoulli distribution in the last exercise would model a coin that is either fair or two-headed. The last two exercises show that the maximum likelihood estimator of a parameter, like the solution to any maximization problem, depends critically on the domain.

The estimator \(U\) satisfies the following bias properties:

- \(\E(U) = \begin{cases} 1, & p = 1 \\ \frac{1}{2} + \left(\frac{1}{2}\right)^{n+1}, & p = \frac{1}{2} \end{cases}\)
- \(U\) is positively biased, but is asymptotically unbiased.

If \(p = 1\) then \(\P(U = 1) = \P(Y = n) = 1\), so trivially \(\E(U) = 1\). If \(p = \frac{1}{2}\), \[ \E(U) = 1 \P(Y = n) + \frac{1}{2} \P(Y \lt n) = 1 \left(\frac{1}{2}\right)^n + \frac{1}{2}\left[1 - \left(\frac{1}{2}\right)^n\right] = \frac{1}{2} + \left(\frac{1}{2}\right)^{n+1} \] For part (b), note that \(\E(U) \to p\) as \(n \to \infty\) both in the case that \(p = 1\) and \(p = \frac{1}{2}\).

The estimator \(U\) satisfies the following mean square error properties:

- \(\mse(U) = \begin{cases} 0 & p = 1 \\ \left(\frac{1}{2}\right)^{n+2}, & p = \frac{1}{2} \end{cases}\)
- \(U\) is consistent.

If \( p = 1 \) then \( U = 1 \) with probability 1, so trivially \( \mse(U) = 0 \). If \( p = \frac{1}{2} \), \[ \mse(U) = \left(1 - \frac{1}{2}\right)^2 \P(Y = n) + \left(\frac{1}{2} - \frac{1}{2}\right)^2 \P(Y \lt n) = \left(\frac{1}{2}\right)^2 \left(\frac{1}{2}\right)^n = \left(\frac{1}{2}\right)^{n+2} \] Part (b) follows from part (a).

\(U\) is uniformly better than \(M\) on the parameter space \(\left\{\frac{1}{2}, 1\right\}\).

Recall that \( \mse(M) = \var(M) = p (1 - p) / n \). If \( p = 1 \) then \( \mse(M) = \mse(U) = 0 \) so that both estimators give the correct answer. If \( p = \frac{1}{2} \), \( \mse(U) = \left(\frac{1}{2}\right)^{n+2} \lt \frac{1}{4 n} = \mse(M) \).

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with unknown success parameter \(p \in (0, 1)\). Find the maximum likelihood estimator of \(p (1 - p)\), which is the variance of the sampling distribution.

\(M (1 - M)\) where \(M\) is the sample mean.

Recall that the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1)\) has probability density function \[ g(x) = p (1 - p)^{x-1}, \quad x \in \N_+ \] The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the geometric distribution with unknown parameter \(p \in (0, 1)\). The maximum likelihood estimator of \(p\) is \(U = 1 / M\).

Recall that \(U\) is also the method of moments estimator of \(p\). It's always reassuring when two different estimation procedures produce the same estimator.

Recall that the Poisson distribution with parameter \(a \gt 0\) has probability density function
\[ g(x) = e^{-a} \frac{a^x}{x!}, \quad x \in \N \]
The Poisson distribution is named for Simeon Poisson and is widely used to model the number of random points

in a region of time or space. The Poisson distribution is studied in more detail in the chapter on the Poisson process.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Poisson distribution with unknown parameter \(a \in (0, \infty)\). The maximum likelihood estimator of \(a\) is the sample mean \(M\).

Recall that for the Poisson distribution, the parameter \(a\) is both the mean and the variance. Thus \(M\) is also the method of moments estimator of \(1\). We showed in the introductory section that \(M\) has smaller mean square error than \(S^2\), although both are unbiased.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Poisson distribution with parameter \(a \gt 0\), and let \(p = \P(X = 0) = e^{-a}\). Find the maximum likelihood estimator of \(p\) in two ways:

- Directly, by finding the likelihood function corresponding to the parameter \(p\).
- By using the result of the last exercise and the invariance property.

\(e^{-M}\) where \(M\) is the sample mean.

Recall that the normal distribution with mean \(\mu\) and variance \(\sigma^2\) has probability density function \[ g(x) = \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left[-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2\right], \quad x \in \R \] The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in the chapter on Special Distributions

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the normal distribution with unknown mean \(\mu \in \R\) and variance \(\sigma^2 \in (0, \infty)\). The maximum likelihood estimators of \(\mu\) and \(\sigma^2\) are \(M\) and \(T^2\), respectively.

Of course, \(M\) and \(T^2\) are also the method of moments estimators of \(\mu\) and \(\sigma^2\), respectively.

Run the Normal estimation experiment 1000 times for several values of the sample size \(n\), the mean \(\mu\), and the variance \(\sigma^2\). For the parameter \(\sigma^2\), compare the maximum likelihood estimator \(T^2\) with the standard sample variance \(S^2\). Which estimator seems to work better in terms of mean square error?

Suppose again that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the normal distribution with unknown mean \(\mu \in \R\) and unknown variance \(\sigma^2 \in (0, \infty)\). Find the maximum likelihood estimator of \(\mu^2 + \sigma^2\), which is the second moment about 0 for the sampling distribution.

\(M^2 + T^2\) where \(M\) is the sample mean and \(T^2\) is the (biased version of the) sample variance.

Recall that the gamma distribution with shape parameter \(k \gt 0\) and scale parameter \(b \gt 0\) has probability density function \[ g(x) = \frac{1}{\Gamma(k) \, b^k} x^{k-1} e^{-x / b}, \quad 0 \lt x \lt \infty \] The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in more detail in the chapter on Special Distributions

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the gamma distribution with known shape parameter \(k\) and unknown scale parameter \(b \in (0, \infty)\). The maximum likelihood estimator of \(b\) is \(V_k = \frac{1}{k} M\).

Recall that \(V_k\) is also the method of moments estimator of \(b\) when \(k\) is known. But when \(k\) is unknown, the method of moments estimator of \(b\) is \(V = \frac{T^2}{M}\).

Run the gamma estimation experiment 1000 times for several values of the sample size \(n\), shape parameter \(k\), and scale parameter \(b\). In each case, compare the method of moments estimator \(V\) of \(b\) when \(k\) is unknown with the method of moments and maximum likelihood estimator \(V_k\) of \(b\) when \(k\) is known. Which estimator seems to work better in terms of mean square error?

Recall that the beta distribution with left parameter \(a \in (0, \infty)\) and right parameter \(b = 1\) has probability density function \[ g(x) = a \, x^{a-1}, \quad 0 \le x \le 1 \] The beta distribution is often used to model random proportions and other random variables that take values in bounded intervals. It is studied in more detail in the chapter on Special Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the beta distribution with unknown left parameter \(a \in (0, \infty)\) and right parameter \(b = 1\). The maximum likelihood estimator of \(a\) is \[ W = - \frac{n}{\sum_{i=1}^n \ln(X_i)} \]

Recall that when \(b = 1\), the method of moments estimator of \(a\) is \(U_1 = M \big/ (1 - M)\), but when \(b \in (0, \infty)\) is also unknown, the method of moments estimator of \(a\) is \(U = M (M - M_2) \big/ (M_2 - M^2)\). When \(b = 1\), which estimator is better, the method of moments estimator or the maximum likelihood estimator?

In the beta estimation experiment, set \(b = 1\). Run the experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the estimators \(U\), \(U_1\) and \(W\). Which estimator seems to work better in terms of mean square error?

Recall that the Pareto distribution with shape parameter \(a \gt 0\) and scale parameter \(b = 1\) has probability density function \[ g(x) = \frac{a}{x^{a+1}}, \quad 1 \le x \lt \infty \] The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution often used to model income and certain other types of random variables. It is studied in more detail in the chapter on Special Distribution.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Pareto distribution with unknown shape parameter \(a \in (0, \infty)\) and \(b = 1\). The maximum likelihood estimator of \(a\) is \[ V = \frac{n}{\sum_{i=1}^n \ln(X_i)} \]

Recall that if \(a \gt 1\) and \(b = 1\), the method of moments estimator of \(a\) is \(U_1 = \frac{M}{M-1}\). On the other hand, when \(b\) is unknown, the method of moments estimator of \(a\) is \(U = 1 + \sqrt{\frac{M_2}{M_2 - M^2}}\).

In the the Pareto estimation experiment, set \(b = 1\). Run the experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the estimators \(U\), \(U_1\), and \(W\). Which estimator seems to work better in terms of mean square error?

In this section we will study two estimation problems that are a good source of insight and counterexamples. In a sense, our first estimation problem is the continuous analogue of an estimation problem studied in the section on Order Statistics in the chapter Finite Sampling Models. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the uniform distribution on the interval \([0, a]\), where \(a \in (0, \infty)\) is an unknown parameter. Thus, the sampling distribution has probability density function \[ g(x) = \frac{1}{a}, \quad 0 \le x \le a \]

The method of moments estimator of \(a\) is \(U = 2 \, M\).

The estimator \(U\) satisfies the following properties:

- \(U\) is unbiased.
- \(\var(U) = \frac{a^2}{3 \, n}\) so \(U\) is consistent.

The maximum likelihood estimator of \(a\) is \(X_{(n)} = \max\{X_1, X_2, \ldots, X_n\}\), the \(n\)th order statistic.

The estimator \(X_{(n)}\) satisfies the following bias properties:

- \(\E\left(X_{(n)}\right) = \frac{n}{n + 1} a\)
- \(\bias\left(X_{(n)}\right) = -\frac{a}{n+1}\) so that \(X_{(n)}\) is negatively biased but asymptotically unbiased.

The estimator \(X_{(n)}\) satisfies the following mean square error properties:

- \(\var\left(X_{(n)}\right) = \frac{n}{(n+2)(n+1)^2} a^2\)
- \(\mse\left(X_{(n)}\right) = \frac{2}{(n+1)(n+2)}a^2\) so that \(X_{(n)}\) is consistent.

Since the expected value of \(X_{(n)}\) is a known multiple of the parameter \(a\), we can easily construct an unbiased estimator. Let \(V = \frac{n+1}{n} X_{(n)}\).

The estimator \(V\) satisfies the following properties:

- \(V\) is unbiased.
- \(\var(V) = \frac{a^2}{n(n + 2)}\) so that \(V\) is consistent.

The asymptotic relative efficiency of \(V\) to \(U\) is infinite.

The last exercise shows that \(V\) is a much better estimator than \(U\). In fact, an estimator such as \(V\), whose mean square error decreases on the order of \(\frac{1}{n^2}\), is called super efficient. Now, having found a really good estimator, let's see if we can find a really bad one. A natural candidate is an estimator based on \(X_{(1)} = \min\{X_1, X_2, \ldots, X_n\}\), the *first* order statistic. The result in the next exercise will make the computations very easy.

\(\bs{X}\) satisfies the following properties:

- \(a - X_i\) is uniformly distributed on \([0, a]\) for each \(i\).
- \((a - X_1, a - X_2, \ldots, a - X_n)\) is also a random sample from the uniform distribution on \([0, a]\).
- \(X_{(1)}\) has the same distribution as \(a - X_{(n)}\).

\(\E\left(X_{(1)}\right) = \frac{a}{n+1}\), and hence \(W = (n + 1)X_{(1)}\) is unbiased.

\(\var(W) = \frac{n}{n+2} a^2\), so \(W\) is not even consistent.

Run the uniform estimation experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the empirical bias and mean square error of the estimators with their theoretical values. Rank the estimators in terms of empirical mean square error.

Our next series of exercises will show that the maximum likelihood estimator is not necessarily unique. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the uniform distribution on the interval \([a, a + 1]\), where \(a \in \R\) is an unknown parameter. Thus, the sampling distribution has probability density function \[ g(x) = 1, \quad a \le x \le a + 1 \]

The method of moments estimator of \(a\) is \(M - \frac{1}{2}\).

The estimator \(U\) satisfies the following properties:

- \(U\) is unbiased.
- \(\var(U) = \frac{1}{12 \, n}\) so \(U\) is consistent.

Any statistic \(V \in \left[X_{(n)} - 1, X_{(1)}\right]\) is a maximum likelihood estimator of \(a\).