Machine Learning Fundamentals : EM Algorithm

Earlier we had seen, that given a set of independent and identically distributed observations drawn from a probability distribution with unknown parameters, we can estimate the parameters of the distribution by maximizing the (log-)likelihood of the distribution generating the observation. We took an example of a series of coin tosses which is drawn from the binomial distribution. But it may not always be analytically tractable to maximize the expression for the distribution.

Consider again the example of the coin toss. But this time instead of one coin, there are two coins A and B. We observe the outcome for N consecutive tosses. Lets say that the probability of coin A coming up heads is p, while the probability of coin B coming up heads is q. Also if we know, which coin was used for which toss, then using Maximum likelihood as earlier, the parameter p would be equal to (# heads with coin A)/(# tosses with coin A) and similarly for q, (# heads with coin B)/(# tosses with coin B). This was straightforward as we had all the information to estimate p and q directly.

But what if we have incomplete information and using incomplete information we need to estimate the unknown parameters. In the example above we already knew which coin was used for which toss, but what if we do not know which coin generated which outcome, then we cannot directly use MLE to estimate the unknown parameters p and q. However, if we had some way of completing the data (in our case, guessing correctly which coin was used in each of the tosses), then we could reduce parameter estimation for this problem with incomplete data to maximum likelihood estimation with complete data.

This is where EM algorithm comes into play. Its a two step iterative algorithm, in the ‘E’ step, using initial parameters (p(0), q(0)), we estimate which of the coins A and B could have generated each outcome. Then using this estimations, in the ‘M’ step, we estimate the new set of parameters (p(1), q(1)) using our old maximum likelihood estimation method. We repeat the above steps, i.e. using (p(t), q(t)) in the ‘E’ step and estimating (p(t+1), q(t+1)) in the ‘M’ step, until the maximum likelihood in attains a local or a global maxima.

Before starting we assume that we have the complete data. In the ‘E’ step for the (t+1)th iteration, we find the probability distributions of the coin A and B generating each outcome, by first finding the expectation of the log likelihood of the complete data with the parameters found in the tth iteration, and then maximizing this expectation. For a more general case of M coins and N tosses, denoting the observed outcomes by Xi , the probability of heads by Θi and the incomplete data by Yi , i.e. if Yi = k, it means that the ith outcome is generated by the kth coin. Also lets assume that we know the prior probabilities of choosing each of the coins, lets assume it is αi i.e. P(Yi = k) = αk .

Then the log likelihood of the complete data (assuming that Y is available) is:

log(L(Θ | X, Y)) = log(P(X, Y | Θ)), by Bayes Theorem

P(Xi , Yi | ΘYi ) = P(Xi | YiYi ) P(Yi | ΘYi ) = αYi P(Xi | YiYi )

because P(Yi | ΘYi ) = P(Yi ) =  αYi are the prior probabilities. Hence:

log(L(Θ | X, Y)) = ∑i log (αYi P(Xi | YiYi ))

The posterior probabilities of Yi given the observed outcomes Xi , is

P(Yi | Xi , ΘYi ) = P(Xi , Yi | ΘYi )/P(Xi | ΘYi ) according to Bayes Theorem. But

P(Xi | ΘYi ) = ∑Yi  P(Xi | YiYi ) P(Yi | ΘYi ) = ∑Yi  αYi P(Xi | YiYi )

Thus, P(Yi | Xi , ΘYi ) = (αYi P(Xi | YiYi ))/(∑Yi  αYi P(Xi | YiYi ))

In the ‘E’ step of the algorithm, we find the quantity

Q(Θ, Θ(t)) = ∑Y log(L(Θ | X, Y)) P(Y | X , Θ(t) )

where Θ(t) is the value of the unknown parameters (probabilities of heads for each coin) in the tth iteration. If t=0, we assume some initial values for the set Θ(t) and at each iteration, we improve upon the values. Q(Θ, Θ(t)) is the expectation of the log(L(Θ | X, Y)). Thus for each Y=k, we multiply the log likelihood with the probability of Y=k, and sum over to find expectation :

‘E’ Step : Q(Θ, Θ(t)) = ∑Y log(L(Θ | X, Y=k)) P(Y=k | X , Θ(t) )

Maximizing the above quantity using Lagrangian multipliers (omitting the derivation), we get the new values of

α(t+1)k = (1/N) ∑i P(k | Xi, Θ(t) )

In the ‘M’ step, differentiating Q(Θ, Θ(t)) w.r.t Θ, and setting them to zero, and using α(t+1)k , yields Θ(t+1). For the case of two coins with probabilities p and q, we get :

p(t+1) = ∑i Xi P(k | Xi, p(t), q(t))/ ∑i P(k | Xi, p(t), q(t) ), and

q(t+1) = ∑i Xi (1 – P(k | Xi, p(t), q(t) ))/ ∑i (1 – P(k | Xi, p(t), q(t) ))

The values of P(k | Xi, p(t), q(t)) for the ith outcome could be inferred from the training samples available. Whenever Q(Θ, Θ(t)) hits local maxima, we stop the iteration and finally report the values of p and q. The expectation maximization algorithm is
a natural generalization of maximum likelihood estimation to the incomplete data case. It finds use while estimating hidden parameters in the Hidden Markov Models, which we will explore next.