On tossing a coin 10 times, head lands 7 times while tail lands 3 times. Assuming that the coin can be either fair or biased, we do not know the probability distribution of each outcome. Lets suppose that the probability of landing heads is p, then the probability of tails would be (1-p). What is the likelihood that this probability distribution generated the outcome i.e. the probability of 7 heads and 3 tails ? Since the tossing of a coin follows the Binomial Distribution, we have

L(p) = 10! / (7! 3!) p^{7}(1-p)^{3} = 120 p^{7}(1-p)^{3}

According to the principle of Maximum Likelihood Estimation, the value of p should be as to maximize the Log-likelihood of the outcome. Computing the Log-likelihood:

Log L(p) = log 120 + 7 log p + 3 log (1-p)

To maximize the Log-likelihood, we differentiate the above w.r.t. p and set it to zero, i.e. (7/p) – 3/(1-p) = 0 or p = 0.7.

MLE is used to estimate the parameters of a probability distribution given a set of independent and identically distributed observations. In the above case it was the distribution of a coin. The same logic could be extended to estimate parameters of complex probability distributions.

Given a sequence of numbers x_{1}, x_{2}, …, x_{N}, following a Normal distribution N(μ, σ) with the parameters μ and σ. Before we estimate the parameters μ and σ, we need to find out what is the likelihood that N(μ, σ) generated the sequence of numbers x_{1}, x_{2}, …, x_{N} :

L(μ, σ | x_{i}) = ∏_{i} (1/√2π) (1/σ) exp(-(x_{i }– μ)^{2}/2σ^{2})

L(μ, σ | x_{i}) = (1/√2π)^{N} (1/σ)^{N} exp(-(1/2σ^{2}) ∑_{i }(x_{i }– μ)^{2})

Log L(μ, σ | x_{i}) = N log (1/√2π) – N log σ – (1/2σ^{2}) ∑_{i }(x_{i }– μ)^{2}

Taking derivative of the above Log likelihood w.r.t. μ, and setting it to zero, we get

μ = (x_{1} + x_{2} + …+ x_{N})/N,

which is the mean of the sequence of numbers. Next taking derivative w.r.t. σ and setting it to zero we get

σ^{2} = (1/N) ∑_{i }(x_{i }– μ)^{2}

Thus σ^{2} is the variance of the sequence of numbers, or σ is the standard deviation. Given a random observation and the probability distribution generating the observation, we can use MLE to derive the parameters for the distribution that generated the observation.