Machine Learning Fundamentals : Maximum Likelihood Estimation

On tossing a coin 10 times, head lands 7 times while tail lands 3 times. Assuming that the coin can be either fair or biased, we do not know the probability distribution of each outcome. Lets suppose that the probability of landing heads is p, then the probability of tails would be (1-p). What is the likelihood that this probability distribution generated the outcome i.e. the probability of 7 heads and 3 tails ? Since the tossing of a coin follows the Binomial Distribution, we have

L(p) = 10! / (7! 3!) p7(1-p)3 = 120 p7(1-p)3

According to the principle of Maximum Likelihood Estimation, the value of p should be as to maximize the Log-likelihood of the outcome. Computing the Log-likelihood:

Log L(p) = log 120 + 7 log p + 3 log (1-p)

To maximize the Log-likelihood, we differentiate the above w.r.t. p and set it to zero, i.e. (7/p) – 3/(1-p) = 0 or p = 0.7.

MLE is used to estimate the parameters of a probability distribution given a set of independent and identically distributed observations. In the above case it was the distribution of a coin. The same logic could be extended to estimate parameters of complex probability distributions.

Given a sequence of numbers x1, x2, …, xN, following a Normal distribution N(μ, σ) with the parameters μ and σ. Before we estimate the parameters μ and σ, we need to find out what is the likelihood that N(μ, σ) generated the sequence of numbers  x1, x2, …, xN :

L(μ, σ | xi) = ∏i (1/√2π) (1/σ) exp(-(xi – μ)2/2σ2)

L(μ, σ | xi) = (1/√2π)N  (1/σ)N  exp(-(1/2σ2) ∑i (xi – μ)2)

Log L(μ, σ | xi) = N log (1/√2π) – N log σ – (1/2σ2)  ∑i (xi – μ)2

Taking derivative of the above Log likelihood w.r.t. μ, and setting it to zero, we get

μ = (x1 + x2 + …+ xN)/N,

which is the mean of the sequence of numbers. Next taking derivative w.r.t. σ and setting it to zero we get

σ2 = (1/N) ∑i (xi – μ)2

Thus σ2 is the variance of the sequence of numbers, or σ is the standard deviation. Given a random observation and the probability distribution generating the observation, we can use MLE to derive the parameters for the distribution that generated the observation.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s