I’ve finally updated and uploaded a detailed note on maximum likelihood estimation, based in part on material I taught in Gov 2001. It is available in full here.
To summarize the note without getting into too much math, let’s first define the likelihood as proportional to the joint probability of the data conditional on the parameter of interest ($\theta$): The maximum likelihood estimate (MLE) of $\theta$ is the value of $\theta$ in the parameter space $\Omega$ that maximizes the likelihood function:
This turns out to be equivalent to maximizing the loglikelihood function (which is often simpler):
One can find the MLE either analytically (using calculus) or numerically (by using R or another program).
A Simple Example
Suppose that we want to visualize the loglikelihood curve for data drawn from a Poisson distribution with an unknown parameter $\lambda$. The data we observe is {2,1,1,4,4,2,1,2,1,2}. In R, we can do this quite simply as:
We already know (based on analytic solutions) that the MLE for $\lambda$ in a Poisson distribution is just the sample mean, which comes out to 2 in this case. Thus, we can mark it on the loglikelihood curve to produce the following graph:
If we wanted to maximize the loglikelihood in R (on the parameter space [0,100], chosen because it’s sufficiently wide to encompass the MLE), we could have done:
R confirms our analytic solution.
Theory of Maximum Likelihood Estimation
Why do we use maximum likelihood estimation? It turns out that subject to regularity conditions the following properties hold for the MLE:

Consistency: as sample size ($n$) increases, the MLE ($\hat{\theta}_{MLE}$) converges to the true parameter, $\theta_0$.

Normality: As sample size ($n$) increases, the MLE is normally distributed with a mean equal to the true parameter ($\theta_0$) and the variance equal to the inverse of the expected sample Fisher information at the true parameter. However, using the consistency property of the MLE, we can use the inverse of the observed sample Fisher information evaluated at the MLE, denoted as $\mathcal{J}n(\hat{\theta}{MLE})$ to approximate the variance. The observed sample Fisher information is the negation of the second derivative of the loglikelihood curve.

Efficiency: maximum likelihood estimation generally provides the lowest variance as sample size increases.