\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\N}{\mathbb{N}}\)
\(\newcommand{\Z}{\mathbb{Z}}\)
\(\newcommand{\E}{\mathbb{E}}\)
\(\newcommand{\P}{\mathbb{P}}\)
\(\newcommand{\var}{\text{var}}\)
\(\newcommand{\sd}{\text{sd}}\)
\(\newcommand{\cov}{\text{cov}}\)
\(\newcommand{\cor}{\text{cor}}\)
\(\newcommand{\bias}{\text{bias}}\)
\(\newcommand{\mse}{\text{mse}}\)
\(\newcommand{\bs}{\boldsymbol}\)

Suppose again that we have an observable random variable \(\bs{X}\) for an experiment, that takes values in a set \(S\). Suppose also that distribution of \(\bs{X}\) depends on an unknown parameter \(\theta\), taking values in a parameter space \(\Theta\). Specifically, we will denote the probability density function of \(\bs{X}\) on \(S\) by \(f_\theta\) for \(\theta \in \Theta\). Of course, our data variable \(\bs{X}\) will almost always be vector-valued. The parameter \(\theta\) may also be vector-valued.

The likelihood function \(L\) is the function obtained by reversing the roles of \(\bs{x}\) and \(\theta\) in the probability density function; that is, we view \(\theta\) as the variable and \(\bs{x}\) as the given information (which is precisely the point of view in estimation): \[ L_\bs{x}(\theta) = f_\theta(\bs{x}); \quad \theta \in \Theta, \; \bs{x} \in S \] In the method of maximum likelihood, we try to find a value \(u(\bs{x})\) of the parameter \(\theta\) that maximizes \(L_\bs{x}(\theta)\) for each \(\bs{x} \in S\). If we can do this, then the statistic \(u(\bs{X})\) is called a maximum likelihood estimator of \(\theta\). The method is intuitively appealing—we try to find the values of the parameters that would have most likely produced the data we in fact observed.

Since the natural logarithm function is strictly increasing on \( (0, \infty) \), the maximum value of \(L_\bs{x}(\theta)\), if it exists, will occur at the same points as the maximum value of \(\ln\left[L_\bs{x}(\theta)\right]\). This latter function is called the log likelihood function and in many cases is easier to work with than the likelihood function (typically because the probability density function \(f_\theta(\bs{x})\) has a product structure).

An important special case is when \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_k)\) is a vector of \(k\) real parameters, so that \(\Theta \subseteq \R^k\). In this case, the maximum likelihood problem is to maximize a function of several variables. If \(\Theta\) is a continuous set, the methods of calculus can be used. If the maximum value of \(L_\bs{x}\) occurs at a point \(\bs{\theta}\) in the interior of \(\Theta\), then \(L_\bs{x}\) has a local maximum at \(\bs{\theta}\). Therefore, assuming that the likelihood function is differentiable, we can find this point by solving \[ \frac{\partial}{\partial \theta_i} L_\bs{x}(\bs{\theta}) = 0, \quad i \in \{1, 2, \ldots, k\} \] or equivalently \[ \frac{\partial}{\partial \theta_i} \ln\left[L_\bs{x}(\bs{\theta})\right] = 0, \quad i \in \{1, 2, \ldots, k\} \] On the other hand, the maximum value may occur at a boundary point of \(\Theta\), or may not exist at all.

Consider next the case where our outcome variable \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution of a random variable \(X\) taking values in \(R\), with probability density function \(g_\theta\) for \(\theta \in \Theta\). Then \(\bs{X}\) takes values in \(S = R^n\), and the joint probability density function of \(\bs{X}\) is the product of the marginal probability density functions. Thus, the likelihood function in this special case becomes \[ L_\bs{x}(\theta) = \prod_{i=1}^n g_\theta(x_i); \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in S, \; \theta \in \Theta \] and hence the log likelihood function becomes \[ \ln\left[L_\bs{x}(\theta)\right] = \sum_{i=1}^n \ln\left[g_\theta(x_i)\right]; \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in S, \; \theta \in \Theta \]

Returning to the general setting, suppose now that \(h\) is a one-to-one function from the parameter space \(\Theta\) onto a set \(\Lambda\). We can view \(\lambda = h(\theta)\) as a new parameter taking values in the space \(\Lambda\), and it is easy to re-parameterize the probability density function with the new parameter. Thus, let \[ \bar{f}_\lambda(\bs{x}) = f_{h^{-1}(\lambda)}(\bs{x}), \quad \bs{x} \in S, \; \lambda \in \Lambda \] The corresponding likelihood function is \[ \bar{L}_\bs{x}(\lambda) = L_\bs{x}\left[h^{-1}(\lambda)\right], \quad \lambda \in \Lambda, \; \bs{x} \in S \]

Suppose that \(u(\bs{x}) \in \Theta\) maximizes \(L_\bs{x}\) for \(\bs{x} \in S\). Then \(h\left[u(\bs{x})\right] \in \Lambda\) maximizes \(\bar{L}_\bs{x}\) for \(\bs{x} \in S\).

It follows from that if \(U\) is a maximum likelihood estimator for \(\theta\), then \(V = h(U)\) is a maximum likelihood estimator for \( \lambda = h(\theta) \). This result is known as the invariance property.

If the function \(h\) is not one-to-one, the maximum likelihood problem for the new parameter \(\lambda = h(\theta)\) is not well-defined, because we cannot parameterize the probability density function in terms of \(\lambda\). However, there is a natural generalization of the maximum likelihood problem in this case. Define \[ \bar{L}_\bs{x}(\lambda) = \max\left\{L_\bs{x}(\theta): \theta \in \Theta, \; h(\theta) = \lambda\right\}; \quad \lambda \in \Lambda, \; \bs{x} \in S \]

Suppose again that \(u(\bs{x}) \in \Theta\) maximizes \(L_\bs{x}\) for \(\bs{x} \in S\). Then \(h\left[u(\bs{x})\right] \in \Lambda\) maximizes \(\bar{L}_\bs{x}\) for \(\bs{x} \in S\).

The result in the last theorem extends the invariance property to many-to-one transformations of the parameter: if \(U\) is a maximum likelihood estimator for \(\theta\), then \(V = h(U)\) is a maximum likelihood estimator for \(\lambda = h(\theta)\).

In the following subsections, we will study maximum likelihood estimation for a number of special parametric families of distributions. Recall that if \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from a distribution with mean \(\mu\) and variance \(\sigma^2\), then the method of moments estimators of \(\mu\) and \(\sigma^2\) are, respectively, \begin{align} M & = \frac{1}{n} \sum_{i=1}^n X_i \\ T^2 & = \frac{1}{n} \sum_{i=1}^n (X_i - M)^2 \end{align} Of course, \(M\) is the sample mean, and \(T^2 = \frac{n - 1}{n} S^2\) where \(S^2\) is the standard sample variance. These statistics will also sometimes occur as maximum likelihood estimators. As always, be sure to try the derivations yourself before looking at the solutions.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with success parameter \(p\). Recall that the Bernoulli probability density function is \[ g(x) = p^x (1 - p)^{1 - x}, \quad x \in \{0, 1\} \] Thus, \(\bs{X}\) is a sequence of independent indicator variables with \(\P(X_i = 1) = p\) for each \(i\). In the usual language of reliability, \(X_i\) is the outcome of trial \(i\), where 1 means success and 0 means failure. Let \(Y = \sum_{i=1}^n X_i\) denote the number of successes, so that the proportion of successes (the sample mean) is \(M = Y / n\). Recall that \(Y\) has the binomial distribution with parameters \(n\) and \(p\).

Suppose that \(p\) varies in the interval \((0, 1)\). Then \(M\) is the maximum likelihood estimator of \(p\).

Note that \(\ln[g(x)] = x \ln(p) + (1 - x) \ln(1 - p)\). Hence the log-likelihood function is \[ \ln[L_{\bs{x}}(p)] = \sum_{i=1}^n [x_i \ln(p) - (1 - x_i) \ln(1 - p)] \] Differentiating with respect to \(p\) and simplifying gives \[ \frac{d}{dp} \ln[L_{\bs{x}}(p)] = \frac{y}{p} - \frac{n - y}{1 - p} \] where \(y = \sum_{i=1}^n x_i\). Thus, there is a single critical point at \(p = y / n = m\). The second deriviative is \[ \frac{d^2}{d p^2} \ln[L_{\bs{x}}(p)] = -\frac{y}{p^2} - \frac{n - 1}{(1 - p)^2} \lt 0 \] Hence the log-likelihood function is concave downward and so the maximum occurs at the unique critical point \(m\).

Recall that \(M\) is also the method of moments estimator of \(p\). It's always nice when two different estimation procedures yield the same result. Next let's look at the same problem, but with a much restricted parameter space.

Suppose now that \(p\) takes values in \(\left\{\frac{1}{2}, 1\right\}\). Then the maximum likelihood estimator of \(p\) is the statistic \[ U = \begin{cases} 1, & Y = n\\ \frac{1}{2}, & Y \lt n \end{cases} \]

- \(\E(U) = \begin{cases} 1, & p = 1 \\ \frac{1}{2} + \left(\frac{1}{2}\right)^{n+1}, & p = \frac{1}{2} \end{cases}\)
- \(U\) is positively biased, but is asymptotically unbiased.
- \(\mse(U) = \begin{cases} 0 & p = 1 \\ \left(\frac{1}{2}\right)^{n+2}, & p = \frac{1}{2} \end{cases}\)
- \(U\) is consistent.

Note that the likelihood function is \(L_{\bs{x}}(p) = p^y (1 - p)^{n-y}\), where as usual, \(y = \sum_{i=1}^n x_i\). Thus \(L_{\bs{x}}\left(\frac{1}{2}\right) = \left(\frac{1}{2}\right)^y\). On the other hand, \(L_{\bs{x}}(1) = 0\) if \(y \lt n\) while \(L_{\bs{x}}(1) = 1\) if \(y = n\). Thus, if \(y = n\) the maximum occurs when \(p = 1\) while if \(y \lt n\) the maximum occurs when \(p = \frac{1}{2}\).

- If \(p = 1\) then \(\P(U = 1) = \P(Y = n) = 1\), so trivially \(\E(U) = 1\). If \(p = \frac{1}{2}\), \[ \E(U) = 1 \P(Y = n) + \frac{1}{2} \P(Y \lt n) = 1 \left(\frac{1}{2}\right)^n + \frac{1}{2}\left[1 - \left(\frac{1}{2}\right)^n\right] = \frac{1}{2} + \left(\frac{1}{2}\right)^{n+1} \]
- Note that \(\E(U) \to p\) as \(n \to \infty\) both in the case that \(p = 1\) and \(p = \frac{1}{2}\).
- If \( p = 1 \) then \( U = 1 \) with probability 1, so trivially \( \mse(U) = 0 \). If \( p = \frac{1}{2} \), \[ \mse(U) = \left(1 - \frac{1}{2}\right)^2 \P(Y = n) + \left(\frac{1}{2} - \frac{1}{2}\right)^2 \P(Y \lt n) = \left(\frac{1}{2}\right)^2 \left(\frac{1}{2}\right)^n = \left(\frac{1}{2}\right)^{n+2}\]
- This follows from part (c).

Note that the Bernoulli distribution in the last exercise would model a coin that is either fair or two-headed. The last two exercises show that the maximum likelihood estimator of a parameter, like the solution to any maximization problem, depends critically on the domain.

\(U\) is uniformly better than \(M\) on the parameter space \(\left\{\frac{1}{2}, 1\right\}\).

Recall that \( \mse(M) = \var(M) = p (1 - p) / n \). If \( p = 1 \) then \( \mse(M) = \mse(U) = 0 \) so that both estimators give the correct answer. If \( p = \frac{1}{2} \), \( \mse(U) = \left(\frac{1}{2}\right)^{n+2} \lt \frac{1}{4 n} = \mse(M) \).

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with unknown success parameter \(p \in (0, 1)\). Find the maximum likelihood estimator of \(p (1 - p)\), which is the variance of the sampling distribution.

By the invariance principle, the estimator is \(M (1 - M)\) where \(M\) is the sample mean.

Recall that the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1)\) has probability density function \[ g(x) = p (1 - p)^{x-1}, \quad x \in \N_+ \] The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the geometric distribution with unknown parameter \(p \in (0, 1)\). The maximum likelihood estimator of \(p\) is \(U = 1 / M\).

Note that \( \ln[g(x)] = \ln(p) + (x - 1) \ln(1 - p) \). Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( \ln[L_\bs{x}(p)] = n \ln(p) + (y - n) \ln(1 - p) \) where \( y = x_1 + x_2 + \cdots + x_n \). So \( \frac{d}{dp} \ln[L(p)] = n / p - (y - n)/(1 - p) \). Setting the derivative equal to 0 and solving gives \( p = n / y = 1 / m \). Finally, \( \frac{d^2}{dp^2} \ln[L_\bs{x}(p)] = -n / p^2 - (y - n) / (1 - p)^2 \lt 0 \) so the maximum occurs at the critical point.

Recall that \(U\) is also the method of moments estimator of \(p\). It's always reassuring when two different estimation procedures produce the same estimator.

More generally, the negative binomial distribution on \( \N \) with shape parameter \( k \in (0, \infty) \) and success parameter \( p \in (0, 1) \) has probability density function \[ g(x) = \binom{x + k - 1}{k - 1} p^k (1 - p)^x, \quad x \in \N \] If \( k \) is a positive integer, then this distribution governs the number of failures before the \( k \)th success in a sequence of Bernoulli trials with success parameter \( p \). However, the distribution makes sense for general \( k \in (0, \infty) \). The negative binomial distribution is studied in more detail in the chapter on Bernoulli Trials.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the negative binomial distribution on \( \N \) with known shape parameter \( k \) and unknown success parameter \( p \in (0, 1) \). The maximum likelihood estimator of \( p \) is \[ U = \frac{k}{k + M} \]

Note that \( \ln[g(x)] = \ln \binom{x + k - 1}{k - 1} + k \ln(p) + x \ln(1 - p) \). Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( \ln[L_\bs{x}(p)] = n k \ln(p) + y \ln(1 - p) + C \) where \( y = \sum_{i=1}^n x_i \) and \( C = \sum_{i=1}^n \ln \binom{x_i + k - 1}{k - 1} \). Hence \( \frac{d}{dp} \ln[L_\bs{x}(p)] = n k / p - y / (1 - p) \). Setting the derivative to 0 and solving gives a critical point at \( p = n k / (n k + y) = k / (k + m) \) where as usual, \( m = y / n \). Finally, \( \frac{d^2}{dp^2} L_\bs{x}(p) = - n k / p^2 - y / (1 - p)^2 \lt 0 \), so the maximum occurs at the critical point.

Once again, this is the same as the method of moments estimator of \( p \) with \( k \) known.

Recall that the Poisson distribution with parameter \(r \gt 0\) has probability density function
\[ g(x) = e^{-r} \frac{r^x}{x!}, \quad x \in \N \]
The Poisson distribution is named for Simeon Poisson and is widely used to model the number of random points

in a region of time or space. The parameter \( r\) is proportional to the size of the region. The Poisson distribution is studied in more detail in the chapter on the Poisson process.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Poisson distribution with unknown parameter \(r \in (0, \infty)\). The maximum likelihood estimator of \(r\) is the sample mean \(M\).

Note that \( \ln[g(x)] = -r + x \ln(r) - \ln(x!) \). Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( \ln[L_\bs{x}(r)] = -n r + y \ln(r) - C \) where \( y = \sum_{i=1}^n x_i \) and \( C = \sum_{i=1}^n \ln(x_i!) \). Hence \( \frac{d}{dr} L_\bs{x}(r) = -n + y / r \). Setting the derivative to 0 and solving gives \( r = y / n = m \). Finally, \( \frac{d^2}{dr^2} \ln[L_\bs{x}(r)] = -y / r^2 \lt 0 \), so the maximum occurs at the critical point.

Recall that for the Poisson distribution, the parameter \(r\) is both the mean and the variance. Thus \(M\) is also the method of moments estimator of \(r\). We showed in the introductory section that \(M\) has smaller mean square error than \(S^2\), although both are unbiased.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Poisson distribution with parameter \(r \gt 0\), and let \(p = \P(X = 0) = e^{-r}\). Find the maximum likelihood estimator of \(p\) in two ways:

- Directly, by finding the likelihood function corresponding to the parameter \(p\).
- By using the result of the last exercise and the invariance property.

\(e^{-M}\) where \(M\) is the sample mean.

Recall that the normal distribution with mean \(\mu\) and variance \(\sigma^2\) has probability density function \[ g(x) = \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left[-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2\right], \quad x \in \R \] The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in the chapter on Special Distributions

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the normal distribution with unknown mean \(\mu \in \R\) and variance \(\sigma^2 \in (0, \infty)\). The maximum likelihood estimators of \(\mu\) and \(\sigma^2\) are \(M\) and \(T^2\), respectively.

Note that \[ \ln[g(x)] = -\frac{1}{2} \ln(2 \pi) - \frac{1}{2} \ln(\sigma^2) - \frac{1}{2 \sigma^2} (x - \mu)^2 \] Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \[ \ln[L_\bs{x}(\mu, \sigma^2)] = -\frac{n}{2} \ln(2 \pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \] Taking partial derivatives gives \begin{align*} \frac{\partial}{\partial \mu} \ln[L_\bs{x}(\mu, \sigma^2)] &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = \frac{1}{\sigma^2}\left(\sum_{i=1}^n x_i - n \mu\right) \\ \frac{\partial}{\partial \sigma^2} \ln[L_\bs{x}(\mu, \sigma^2)] &= -\frac{n}{2 \sigma^2} + \frac{1}{2 \sigma^4} \sum_{i=1}^n (x_i - \mu)^2 \end{align*} Setting the partial derivatives equal to 0 and solving gives \( \mu = \frac{1}{n} \sum_{i=1}^n x_i\) and \( \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 \). Hence the unique critical point is \( (m, t^2) \). Finally, with a bit more calculus, the second partial derivatives evaluated at the critical point are \[ \frac{\partial^2}{\partial \mu^2} L_\bs{x}(m, t^2) = -n / t^2, \; \frac{\partial^2}{\partial \mu \partial \sigma^2} L_\bs{x}(m, t^2) = 0, \; \frac{\partial^2}{\partial (\sigma^2)^2} L_\bs{x}(m, t^2) = -n / t^4\] Hence the second derivative matrix at the critical point is negative definite and so the maximum occurs at the critical point.

Of course, \(M\) and \(T^2\) are also the method of moments estimators of \(\mu\) and \(\sigma^2\), respectively.

Run the Normal estimation experiment 1000 times for several values of the sample size \(n\), the mean \(\mu\), and the variance \(\sigma^2\). For the parameter \(\sigma^2\), compare the maximum likelihood estimator \(T^2\) with the standard sample variance \(S^2\). Which estimator seems to work better in terms of mean square error?

Suppose again that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the normal distribution with unknown mean \(\mu \in \R\) and unknown variance \(\sigma^2 \in (0, \infty)\). Find the maximum likelihood estimator of \(\mu^2 + \sigma^2\), which is the second moment about 0 for the sampling distribution.

By the invariance principle, the estimator is \(M^2 + T^2\) where \(M\) is the sample mean and \(T^2\) is the (biased version of the) sample variance.

Recall that the gamma distribution with shape parameter \(k \gt 0\) and scale parameter \(b \gt 0\) has probability density function \[ g(x) = \frac{1}{\Gamma(k) \, b^k} x^{k-1} e^{-x / b}, \quad 0 \lt x \lt \infty \] The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in more detail in the chapter on Special Distributions

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the gamma distribution with known shape parameter \(k\) and unknown scale parameter \(b \in (0, \infty)\). The maximum likelihood estimator of \(b\) is \(V_k = \frac{1}{k} M\).

Note that \[ \ln[g(x)] = -\ln[\Gamma(k)] - k \ln(b) + (k - 1) \ln(x) - \frac{x}{b} \] and hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( \ln[L_\bs{x}(b)] = - n k \ln(b) - y / b + C\) where \( y = \sum_{i=1}^n x_i \) and \( C = -n \ln[\Gamma(k)] + (k - 1) \sum_{i=1}^n \ln(x_i) \). It follows that \( \frac{d}{d b} \ln[L_\bs{x}(b)] = -n k / b + y / b^2 \). Setting the derivative to 0 and solving gives \( b = y / n k = 1 / k m \). Finally, \( \frac{d^2}{db^2} \ln[L_\bs{x}(b)] = n k / b^2 - 2 y / b^3 \). At the critical point \( b = y / n k \), the second derivative is \(-(n k)^3 / y^2 \lt 0\) so the maximum occurs at the critical point.

Recall that \(V_k\) is also the method of moments estimator of \(b\) when \(k\) is known. But when \(k\) is unknown, the method of moments estimator of \(b\) is \(V = \frac{T^2}{M}\).

Run the gamma estimation experiment 1000 times for several values of the sample size \(n\), shape parameter \(k\), and scale parameter \(b\). In each case, compare the method of moments estimator \(V\) of \(b\) when \(k\) is unknown with the method of moments and maximum likelihood estimator \(V_k\) of \(b\) when \(k\) is known. Which estimator seems to work better in terms of mean square error?

Recall that the beta distribution with left parameter \(a \in (0, \infty)\) and right parameter \(b = 1\) has probability density function \[ g(x) = a \, x^{a-1}, \quad 0 \le x \le 1 \] The beta distribution is often used to model random proportions and other random variables that take values in bounded intervals. It is studied in more detail in the chapter on Special Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the beta distribution with unknown left parameter \(a \in (0, \infty)\) and right parameter \(b = 1\). The maximum likelihood estimator of \(a\) is \[ W = - \frac{n}{\sum_{i=1}^n \ln(X_i)} \]

Note that \( \ln[g(x)] = \ln(a) + (a - 1) \ln(x) \). Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( \ln[L_\bs{x}(a)] = n \ln(a) + (a - 1) \sum_{i=1}^n \ln(x_i) \). Therefore \( \frac{d}{da} \ln[L_\bs{x}(a)] = n / a + \sum_{i=1}^n \ln(x_i) \). Setting the derivative to 0 and solving gives \( a = -n \big/ \sum_{i=1}^n \ln(x_i) \). Finally, \( \frac{d^2}{da^2} \ln[L_\bs{x}(a)] = -n / a^2 \lt 0 \), so the maximum occurs at the critical point.

Recall that when \(b = 1\), the method of moments estimator of \(a\) is \(U_1 = M \big/ (1 - M)\), but when \(b \in (0, \infty)\) is also unknown, the method of moments estimator of \(a\) is \(U = M (M - M_2) \big/ (M_2 - M^2)\). When \(b = 1\), which estimator is better, the method of moments estimator or the maximum likelihood estimator?

In the beta estimation experiment, set \(b = 1\). Run the experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the estimators \(U\), \(U_1\) and \(W\). Which estimator seems to work better in terms of mean square error?

Recall that the Pareto distribution with shape parameter \(a \gt 0\) and scale parameter \(b = 1\) has probability density function \[ g(x) = \frac{a}{x^{a+1}}, \quad 1 \le x \lt \infty \] The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution often used to model income and certain other types of random variables. It is studied in more detail in the chapter on Special Distribution.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Pareto distribution with unknown shape parameter \(a \in (0, \infty)\) and \(b = 1\). The maximum likelihood estimator of \(a\) is \[ V = \frac{n}{\sum_{i=1}^n \ln(X_i)} \]

Note that \( \ln[g(x)] = \ln(a) - (a + 1) \ln(x) \). Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( \ln[L_\bs{x}(a)] = n \ln(a) - (a + 1) \sum_{i=1}^n \ln(x_i) \). Therefore \( \frac{d}{da} \ln[L_\bs{x}(a)] = n / a - \sum_{i=1}^n \ln(x_i) \). Setting the derivative to 0 and solving gives \( a = n \big/ \sum_{i=1}^n \ln(x_i) \). Finally, \( \frac{d^2}{da^2} \ln[L_\bs{x}(a)] = -n / a^2 \lt 0 \), so the maximum occurs at the critical point.

Recall that if \(a \gt 1\) and \(b = 1\), the method of moments estimator of \(a\) is \(U_1 = \frac{M}{M-1}\). On the other hand, when \(b\) is unknown, the method of moments estimator of \(a\) is \(U = 1 + \sqrt{\frac{M_2}{M_2 - M^2}}\).

In the the Pareto estimation experiment, set \(b = 1\). Run the experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the estimators \(U\), \(U_1\), and \(W\). Which estimator seems to work better in terms of mean square error?

In this section we will study estimation problems related to the uniform distribution that are a good source of insight and counterexamples. In a sense, our first estimation problem is the continuous analogue of an estimation problem studied in the section on Order Statistics in the chapter Finite Sampling Models. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the uniform distribution on the interval \([0, h]\), where \(h \in (0, \infty)\) is an unknown parameter. Thus, the sampling distribution has probability density function \[ g(x) = \frac{1}{h}, \quad 0 \le x \le h \] First let's review results from the last section.

The method of moments estimator of \(h\) is \(U = 2 M\). The estimator \(U\) satisfies the following properties:

- \(U\) is unbiased.
- \(\var(U) = \frac{h^2}{3 n}\) so \(U\) is consistent.

Now let's find the maximum likelihood estimator

The maximum likelihood estimator of \(h\) is \(X_{(n)} = \max\{X_1, X_2, \ldots, X_n\}\), the \(n\)th order statistic. The estimator \(X_{(n)}\) satisfies the following properties:

- \(\E\left(X_{(n)}\right) = \frac{n}{n + 1} h\)
- \(\bias\left(X_{(n)}\right) = -\frac{h}{n+1}\) so that \(X_{(n)}\) is negatively biased but asymptotically unbiased.
- \(\var\left(X_{(n)}\right) = \frac{n}{(n+2)(n+1)^2} h^2\)
- \(\mse\left(X_{(n)}\right) = \frac{2}{(n+1)(n+2)}h^2\) so that \(X_{(n)}\) is consistent.

The likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( L_\bs{x}(h) = 1 / h^n \) for \( 0 \le x_i \le h \) and \( i \in \{1, 2, \ldots n\} \). The domain is equivalent to \( 0 \le x_{(n)} \le h\). The function \( h \mapsto 1 / h^n \) is decreasing, and so the maximum occurs at the smallest value, namely \( x_{(n)} \). Parts (a) and (c) are restatements of results from the section on order statistics. Parts (b) and (d) follow from (a) and (c).

Since the expected value of \(X_{(n)}\) is a known multiple of the parameter \(h\), we can easily construct an unbiased estimator.

Let \(V = \frac{n+1}{n} X_{(n)}\). The estimator \(V\) satisfies the following properties:

- \(V\) is unbiased.
- \(\var(V) = \frac{h^2}{n(n + 2)}\) so that \(V\) is consistent.
- The asymptotic relative efficiency of \(V\) to \(U\) is infinite.

Parts (a) and (b) follow from the previous result and basic properties of the expected value and variance. For part (c), \[ \frac{\var(U)}{\var(V)} = \frac{h^2 / 3 n}{h^2 / n (n + 2)} = \frac{n + 2}{3} \to \infty \text{ as } n \to \infty \]

The last part shows that unbiased version \(V\) of the maximum likelihood estimator is a much better estimator than the method of moments estimator \(U\). In fact, an estimator such as \(V\), whose mean square error decreases on the order of \(\frac{1}{n^2}\), is called super efficient. Now, having found a really good estimator, let's see if we can find a really bad one. A natural candidate is an estimator based on \(X_{(1)} = \min\{X_1, X_2, \ldots, X_n\}\), the *first* order statistic. The next result will make the computations very easy.

The sample \(\bs{X} = (X_1, X_2, \ldots, X_n)\) satisfies the following properties:

- \(h - X_i\) is uniformly distributed on \([0, h]\) for each \(i\).
- \((h - X_1, h - X_2, \ldots, h - X_n)\) is also a random sample from the uniform distribution on \([0, h]\).
- \(X_{(1)}\) has the same distribution as \(h - X_{(n)}\).

- This is a simple consequence of the fact that uniform distributions are preserved under linear transformations on the random variable.
- This follows from (a) and that the fact that if \( \bs{X} \) is a sequence of independent variables, then so is \( (h - X_1, h - X_2, \ldots, h - X_n) \).
- From part (b), \( X_{(1)} = \min\{X_1, X_2, \ldots, X_n\} \) has the same distribution as \( \min\{h - X_1, h - X_2, \ldots, h - X_n\} = h - \max\{X_1, X_2, \ldots, X_n\} = h - X_{(n)} \).

Now we can construct our really bad estimator.

Let \(W = (n + 1)X_{(1)}\). Then

- \( W \) is an unbiased estimator of \( h \).
- \(\var(W) = \frac{n}{n+2} h^2\), so \(W\) is not even consistent.

These results follow from the ones above:

- \( \E(X_{(1)}) = h - \E(X_{(n)}) = h - \frac{n}{n + 1} h = \frac{1}{n + 1} h \) and hence \( \E(W) = h \).
- \( \var(W) = (n + 1)^2 \var(X_{(1)}) = (n + 1)^2 \var(h - X_{(n)}) = (n + 1)^2 \frac{n}{(n + 1)^2 (n + 2)} = \frac{n}{n + 2} h^2\).

Run the uniform estimation experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the empirical bias and mean square error of the estimators with their theoretical values. Rank the estimators in terms of empirical mean square error.

Our next series of exercises will show that the maximum likelihood estimator is not necessarily unique. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the uniform distribution on the interval \([a, a + 1]\), where \(a \in \R\) is an unknown parameter. Thus, the sampling distribution has probability density function \[ g(x) = 1, \quad a \le x \le a + 1 \] As usual, let's first review the method of moments estimator.

The method of moments estimator of \(a\) is \(U = M - \frac{1}{2}\). The estimator \(U\) satisfies the following properties:

- \(U\) is unbiased.
- \(\var(U) = \frac{1}{12 n}\) so \(U\) is consistent.

However, as promised, there is not a unique maximum likelihood estimatr.

Any statistic \(V \in \left[X_{(n)} - 1, X_{(1)}\right]\) is a maximum likelihood estimator of \(a\).

The likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n\} \) is \( L_\bs{x}(a) = 1 \) for \( a \le x_i \le a + 1 \) and \( i \in \{1, 2, \ldots, n\} \). The domain is equivalent to \( a \le x_{(1)} \) and \( a \ge x_{(n)} - 1 \). Since the likelihood function is constant on this domain, the result follows.

For completeness, let's consider the full estimation problem. Suppose that \( \bs{X} = (X_1, X_2, \ldots, X_n) \) is a random sample of size \( n \) from the uniform distribution on \( [a, a + h] \) where \( a \in \R \) and \( h \in (0, \infty) \) are both unknown. Here's the result from the last section:

Let \( U \) and \( V \) denote the method of moments estimators of \( a \) and \( h \), respectively. Then \[ U = 2 M - \sqrt{3} T, \quad V = 2 \sqrt{3} T \] where \( M = \frac{1}{n} \sum_{i=1}^n X_i \) is the sample mean, and \( T = \frac{1}{n} \sum_{i=1}^n (X_i - M)^2 \) is the biased version of the sample variance.

It should come as no surprise at this point that the maximum likelihood estimators are functions of the largest and smallest order statistics.

The maximum likelihood estimators or \( a \) and \( h \) are \( U = X_{(1)} \) and \( V = X_{(n)} - X_{(1)} \), respectively.

- \( E(U) = a + \frac{h}{n + 1} \) so \( U \) is positively biased and asymptotically unbiased.
- \( E(V) = h \frac{n - 1}{n + 1} \) so \( V \) is negatively biased and asymptotically unbiased.
- \( \var(U) = h^2 \frac{n}{(n + 1)^2 (n + 2)} \) so \( U \) is consistent.
- \( \var(V) = h^2 \frac{2(n - 1)}{(n + 1)^2(n + 2)} \) so \( V \) is consistent.

The likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( L_\bs{x}(a, h) = \frac{1}{h^n} \) for \( a \le x_i \le a + h \) and \( i \in \{1, 2, \ldots, n\} \). The domain is equivalent to \( a \le x_{(1)} \) and \( a + h \ge x_{(n)} \). Since the likelihood function depends only on \( h \) in this domain and is decreasing, the maximum occurs when \( a = x_{(1)} \) and \( h = x_{(n)} - x_{(1)} \). Parts (a)–(d) follow from standard results for the order statistics from the uniform distribution.