Earlier we had seen, that given a set of independent and identically distributed observations drawn from a probability distribution with unknown parameters, we can estimate the parameters of the distribution by maximizing the (log-)likelihood of the distribution generating the observation. We took an example of a series of coin tosses which is drawn from the binomial distribution. But it may not always be analytically tractable to maximize the expression for the distribution.

Consider again the example of the coin toss. But this time instead of one coin, there are two coins A and B. We observe the outcome for N consecutive tosses. Lets say that the probability of coin A coming up heads is p, while the probability of coin B coming up heads is q. Also if we know, which coin was used for which toss, then using Maximum likelihood as earlier, the parameter p would be equal to (# heads with coin A)/(# tosses with coin A) and similarly for q, (# heads with coin B)/(# tosses with coin B). This was straightforward as we had all the information to estimate p and q directly.

But what if we have incomplete information and using incomplete information we need to estimate the unknown parameters. In the example above we already knew which coin was used for which toss, but what if we do not know which coin generated which outcome, then we cannot directly use MLE to estimate the unknown parameters p and q. However, if we had some way of completing the data (in our case, guessing correctly which coin was used in each of the tosses), then we could reduce parameter estimation for this problem with incomplete data to maximum likelihood estimation with complete data.

This is where EM algorithm comes into play. Its a two step iterative algorithm, in the ‘E’ step, using initial parameters (p^{(0)}, q^{(0)}), we estimate which of the coins A and B could have generated each outcome. Then using this estimations, in the ‘M’ step, we estimate the new set of parameters (p^{(1)}, q^{(1)}) using our old maximum likelihood estimation method. We repeat the above steps, i.e. using (p^{(t)}, q^{(t)}) in the ‘E’ step and estimating (p^{(t+1)}, q^{(t+1)}) in the ‘M’ step, until the maximum likelihood in attains a local or a global maxima.

Before starting we assume that we have the complete data. In the ‘E’ step for the (t+1)^{th} iteration, we find the probability distributions of the coin A and B generating each outcome, by first finding the expectation of the log likelihood of the complete data with the parameters found in the t^{th} iteration, and then maximizing this expectation. For a more general case of M coins and N tosses, denoting the observed outcomes by X_{i} , the probability of heads by Θ_{i} and the incomplete data by Y_{i} , i.e. if Y_{i} = k, it means that the i^{th} outcome is generated by the k^{th} coin. Also lets assume that we know the prior probabilities of choosing each of the coins, lets assume it is α_{i} i.e. P(Y_{i} = k) = α_{k} .

Then the log likelihood of the complete data (assuming that Y is available) is:

log(L(Θ | X, Y)) = log(P(X, Y | Θ)), by Bayes Theorem

P(X_{i} , Y_{i} | Θ_{Y}_{i} ) = P(X_{i} | Y_{i} ,Θ_{Y}_{i} ) P(Y_{i} | Θ_{Y}_{i} ) = α_{Y}_{i} P(X_{i} | Y_{i} ,Θ_{Y}_{i} )

because P(Y_{i} | Θ_{Y}_{i} ) = P(Y_{i} ) = α_{Y}_{i} are the prior probabilities. Hence:

log(L(Θ | X, Y)) = ∑_{i} log (α_{Y}_{i} P(X_{i} | Y_{i} ,Θ_{Y}_{i} ))

The posterior probabilities of Y_{i} given the observed outcomes X_{i} , is

P(Y_{i} | X_{i} , Θ_{Y}_{i} ) = P(X_{i} , Y_{i} | Θ_{Y}_{i} )/P(X_{i} | Θ_{Y}_{i} ) according to Bayes Theorem. But

P(X_{i} | Θ_{Y}_{i} ) = ∑_{Y}_{i} P(X_{i} | Y_{i} ,Θ_{Y}_{i} ) P(Y_{i} | Θ_{Y}_{i} ) = ∑_{Y}_{i} α_{Y}_{i} P(X_{i} | Y_{i} ,Θ_{Y}_{i} )

Thus, P(Y_{i} | X_{i} , Θ_{Y}_{i} ) = (α_{Y}_{i} P(X_{i} | Y_{i} ,Θ_{Y}_{i} ))/(∑_{Y}_{i} α_{Y}_{i} P(X_{i} | Y_{i} ,Θ_{Y}_{i} ))

In the ‘E’ step of the algorithm, we find the quantity

Q(Θ, Θ^{(t)}) = ∑_{Y} log(L(Θ | X, Y)) P(Y | X , Θ^{(t)} )

where Θ^{(t)} is the value of the unknown parameters (probabilities of heads for each coin) in the t^{th} iteration. If t=0, we assume some initial values for the set Θ^{(t)} and at each iteration, we improve upon the values. Q(Θ, Θ^{(t)}) is the expectation of the log(L(Θ | X, Y)). Thus for each Y=k, we multiply the log likelihood with the probability of Y=k, and sum over to find expectation :

‘E’ Step : Q(Θ, Θ^{(t)}) = ∑_{Y} log(L(Θ | X, Y=k)) P(Y=k | X , Θ^{(t)} )

Maximizing the above quantity using Lagrangian multipliers (omitting the derivation), we get the new values of

α^{(t+1)}_{k} = (1/N) ∑_{i} P(k | X_{i}, Θ^{(t)} )

In the ‘M’ step, differentiating Q(Θ, Θ^{(t)}) w.r.t Θ, and setting them to zero, and using α^{(t+1)}_{k} , yields Θ^{(t+1)}. For the case of two coins with probabilities p and q, we get :

p^{(t+1)} = ∑_{i} X_{i }P(k | X_{i}, p^{(t)}, q^{(t)})/ ∑_{i} P(k | X_{i}, p^{(t)}, q^{(t)} ), and

q^{(t+1)} = ∑_{i} X_{i} (1 – P(k | X_{i}, p^{(t)}, q^{(t)} ))/ ∑_{i} (1 – P(k | X_{i}, p^{(t)}, q^{(t)} ))

The values of P(k | X_{i}, p^{(t)}, q^{(t)}) for the i^{th} outcome could be inferred from the training samples available. Whenever Q(Θ, Θ^{(t)}) hits local maxima, we stop the iteration and finally report the values of p and q. The expectation maximization algorithm is

a natural generalization of maximum likelihood estimation to the incomplete data case. It finds use while estimating hidden parameters in the Hidden Markov Models, which we will explore next.