Machine Learning Fundamentals : Maximum Likelihood Estimation

On tossing a coin 10 times, head lands 7 times while tail lands 3 times. Assuming that the coin can be either fair or biased, we do not know the probability distribution of each outcome. Lets suppose that the probability of landing heads is p, then the probability of tails would be (1-p). What is the likelihood that this probability distribution generated the outcome i.e. the probability of 7 heads and 3 tails ? Since the tossing of a coin follows the Binomial Distribution, we have

L(p) = 10! / (7! 3!) p7(1-p)3 = 120 p7(1-p)3

According to the principle of Maximum Likelihood Estimation, the value of p should be as to maximize the Log-likelihood of the outcome. Computing the Log-likelihood:

Log L(p) = log 120 + 7 log p + 3 log (1-p)

To maximize the Log-likelihood, we differentiate the above w.r.t. p and set it to zero, i.e. (7/p) – 3/(1-p) = 0 or p = 0.7.

MLE is used to estimate the parameters of a probability distribution given a set of independent and identically distributed observations. In the above case it was the distribution of a coin. The same logic could be extended to estimate parameters of complex probability distributions.

Given a sequence of numbers x1, x2, …, xN, following a Normal distribution N(μ, σ) with the parameters μ and σ. Before we estimate the parameters μ and σ, we need to find out what is the likelihood that N(μ, σ) generated the sequence of numbers  x1, x2, …, xN :

L(μ, σ | xi) = ∏i (1/√2π) (1/σ) exp(-(xi – μ)2/2σ2)

L(μ, σ | xi) = (1/√2π)N  (1/σ)N  exp(-(1/2σ2) ∑i (xi – μ)2)

Log L(μ, σ | xi) = N log (1/√2π) – N log σ – (1/2σ2)  ∑i (xi – μ)2

Taking derivative of the above Log likelihood w.r.t. μ, and setting it to zero, we get

μ = (x1 + x2 + …+ xN)/N,

which is the mean of the sequence of numbers. Next taking derivative w.r.t. σ and setting it to zero we get

σ2 = (1/N) ∑i (xi – μ)2

Thus σ2 is the variance of the sequence of numbers, or σ is the standard deviation. Given a random observation and the probability distribution generating the observation, we can use MLE to derive the parameters for the distribution that generated the observation.

Document Sequencing using Dynamic Programming

Sectors dealing with large volumes of documents usually deploy batch scanning software for storing the documents in electronic format as image files. This software scans thousands of paper documents as a single batch and stores them in designated folder structure, and as a result multi-page documents lose their sequence information. With reference to the financial industry, scanning a bunch of 6 pages of a Mortgage application and 4 pages of an IRS Form using a batch scanner, could possibly end up numbering the pages such that the first 4 images are from the Mortgage application, the next two images from the IRS Form, the next 3 from the remaining Mortgage Application pages and the last remaining from the IRS Form.

There are techniques like putting 1-D or 2-D bar codes on the first pages of each document for separation. We will use an algorithmic technique to automatically sequence and separate the documents from the scanned image files. Firstly we will do OCR on the image files to convert them to raw text files and then use Supervised learning algorithms for textual classification.

The problem of classifying a page into a Loan Application, or a Mortgage or a Credit Report can be achieved with good enough accuracy using a supervised learner like a Support Vector Machine, or a Random Forest or an Ensemble of learners. We would usually train the model with classes such as Mortgage or Credit Report and so on. But when we need to identify sequences and sequence boundaries, we have to modify the model to consider the page type of a document along with the document type i.e. now instead of Mortgage, we would model with classes Mortgage First Page, Mortgage Middle Page and Mortgage Last Page.

Given our trained model and a bunch of pages to classify and sequence, the model predicts document type as well as the page type for each page in the bunch along with the confidence scores for the prediction. Lets denote the pages as Ai, the predicted document types as di, the predicted page types as pi and the confidence scores as si for i 1 to N. Assuming that the probability distribution of the confidence scores si follow a normal distribution, then any sequence of length k formed with the pages Aj, Aj+1, …, Aj+k-1, would have the most likely score which is the mean of the scores for each page in the sequence i.e.

Sj, j+k-1 = (sj + sj+1 + … + sj+k-1)/k,

where Sj, j+k-1 denotes the score of sequence starting at j of length k. Then the probability that the page Aj, with score sj, belongs to a sequence with score Sj, j+k-1 is proportional to

exp(-(Sj, j+k-1 – sj)2)

Thus the probability that each of the pages Aj, Aj+1, …, Aj+k-1, belongs to a sequence with score Sj, j+k-1 is the product of the above probabilities for each score si i.e.

Pj, j+k-1 = ∏i exp(-(Sj, j+k-1 – si)2) = exp(-∑i (Sj, j+k-1 – si)2) for i = j to j+k-1

For pages Ak, and Ak-1, if the predicted document types dk ≠ dk-1, then we assume that the page Ak cannot be in the same sequence as Ak-1, hence the probability of page Ak belonging to the previous sequence is 0, but Ak can start another sequence or it can be itself be a standalone sequence. In cases where the page Ak has a very low score then we can safely ignore the condition that dk ≠ dk-1,and consider it to be of the same document type as Ak-1 i.e. dk = dk-1 and continue forming a sequence without breaking.

Since we have modelled and predicted the page type for each page i.e. “First”, “Middle” and “Last”, we can use it for sequence formation. For example we can set rules that a sequence will only start at “First” page, or ends in a “Last” Page or a “Middle” page followed by “Middle” Pages and “Last” Pages in a sequence. In multi-page documents, the “First” page is discriminative from “Middle” and “Last” pages, but since a “Middle” page of one sample could be similar to the “Last” page of another sample and vice-verse in the training set, hence the “Middle” and “Last” pages are not very discriminative from each other in the predicted bunch. Thus we will only use the rule that a sequence starts at a “First” page.

Firstly we manually sequence a bunch of pages based on the “First” pages and if the predicted document type dk ≠ dk-1. After this we have multiple sequences from the original bunch of pages. Is it possible to further sub-sequence each sequence ?  Lets say that for the sequence {Aj, Aj+1, …, Aj+k-1} starting at the j-th index and of length k, the probability of the sequence is Pj, j+k-1 . Is there an index ‘v’, such that j ≤ v < j+k-1 and Pj, v * Pv+1, j+k-1 > Pj, j+k-1 ? If there is an index ‘v’ which satisfies the mentioned criteria then we further sub-sequence the sequence starting at j of length k, because it maximizes the probability of the sequence. If there are multiple such ‘v’ s then select the one for which Pj, v * Pv+1, j+k-1 is maximum, i.e.

arg maxv  {Pj, v * Pv+1, j+k-1 }> Pj, j+k-1

Again for each sub-sequence formed we can recursively sub-sequence each if we can find an index, at which if we separate we get a higher probability of each sub-sequence. This procedure would produce the maximum probable sequence. One way to do this is to use recursion on each sub-sequence and find the appropriate partitions. But recursion would do redundant calculations and we know the tool to solve the problem, yes Dynamic Programming.

For the sequence {Aj, Aj+1, …, Aj+k-1} find all 1-length sub-sequences. Each 1-length sub-sequences would have the probability equal to 1 since the mean of the score is the score itself i.e. Pj, j = 1. Then for all 2-length sub-sequences, {Aj, Aj+1}. For a 2-length sub-sequence if Pj, j * Pj+1, j+1 > Pj, j+1 then we have separate sub-sequences {Aj} {Aj+1} and updated Pj, j+1 = Pj, j * Pj+1, j+1 , else we have an “unbroken” sequence {Aj, Aj+1}. Extending this calculation to some q-length (q < k) sub-sequence, find

Q = arg maxv {Pj, v * Pv+1, j+q-1} for some j ≤ v < j+q-1

If Q > Pj, j+q-1 then split the the sub-sequence {Aj, Aj+1, …, Aj+q-1} at ‘v’, {Aj, Aj+1, …, Av} {Av+1, Aj+1, …, Aj+q-1} and assign the new and updated probability Pj, j+q-1 = Q. This works because we have already found out the probabilities Pj,v and Pv+1, j+q-1 in earlier steps and also the probabilities Pj, v and Pv+1, j+q-1 are optimal for the sub-sequences starting at j and ending at v and starting at v+1 and ending at j+q-1 respectively. Finally we do the calculations for the k-length sub-sequence and find ‘v’ for which Pj, v * Pv+1, j+k-1 is maximum and then partition the sequence into {Aj, Aj+1, …, Av} {Av+1, Aj+1, …, Aj+k-1}.

After finding the partition for the k-length sequence, how do we recover all the partitions for sub-sequences within the k-length sequence. We do it recursively. We use an array to track the partitions for each q-length sub-sequences in the earlier steps. After finding the partition ‘v’ for the k-length sequence, we recursively find the partitions for the v-j+1 length sub-sequence on the left and j+k-v length sub-sequence on the right using the array of tracked partitions.

Machine Learning Fundamentals Part 2 : Bayesian Classification

Given the task of classifying a product review as “positive” or “negative”, first we need to train a model using already labelled samples. For example lets say that 1000 reviews are available with the tags “positive” or “negative” associated with each review. Now to train a model using the distribution of words in the sample reviews, compute the following probabilities P(w1), P(w2), … where P(wi) denotes the fraction of reviews containing the word wi i.e. if the word w1 occurs in 30 out of 1000 reviews, then P(w1) = 0.03. Now lets say that 800 out of 1000 reviews are labelled “positive” while remaining 200 are labelled as “negative”. Then P(+) = 0.8 and P(-) = 0.2.

What we need to estimate is, given an unlabelled review and the set of words w’1, w’2, …. in it, which is a subset of the words w1, w2, …. from our training samples, what is the probability of the review being “positive” (probability of it being “negative” would be 1 minus that), i.e. P(+|w’1, w’2, ….). From our earlier tutorial we saw that using Bayes’ Theorem:

P(+|w’1, w’2, ….) = (P(w’1, w’2, …. | +) * P(+))/P(w’1, w’2, ….)

Since the value P(w’1, w’2, ….) is just a scaling factor and is constant for both the categories “positive” and “negative”, we can omit that. P(+) is already known to be 0.8. We need to estimate the probability P(w’1, w’2, …. | +) i.e. given that a review is “positive” what is the likelihood of the review “generating” the words w’1, w’2, ….. Using the chain rule of probability :

P(w’1, w’2, …. | +) = P(w’1 | w’2, w’3, …., + ) * P(w’2 | w’3, w’4, …, +) * …* P(w’M|+)

From out training samples we can estimate each of the above product terms, but given a large number of terms, it will take a lot of time to find the above distribution for each review and will not be feasible for any real time prediction. Instead we assume that each word is independent of other words and each word is only dependent on the category that generated it i.e. P(w’1 | w’2, w’3, w’4, …., + ) = P(w’1 | +), P(w’2 | w’3, w’4, w’5, …., +) = P(w’2 | +), and so on for other words. This is the Naive Bayes assumption. With the above assumptions, we have :

P(w’1, w’2, …. | +) = P(w’1 | +) * P(w’2 | +) * …. P(w’M | +)

From the training samples we can easily estimate the above terms beforehand i.e. Given positive reviews only, P(w’1 | +) denotes what fraction of the positive reviews contains the word w’1. In fact, to account for words which are not present in “positive” reviews and subsequently avoid multiplying by a zero, the values P(w’i | +) are computed as

(1 + Number of positive reviews w’i occurs in)/(Number words + Number of positive reviews)

After computing the quantity P(w’1, w’2, …. | +) we can compute the quantity P(+|w’1, w’2, ….), which if greater than 0.6(or any other value greater than 0.5), we say that the review is positive. While for the negative case, we might say that if 1-P(+|w’1, w’2, ….) is less than 0.25, then the review is negative, else if it is between 0.25 and 0.6 then we say that we are not sure.

The above model can be easily extended to multiple classes instead of just two classes (positive and negative), for example in medical diagnosis, the classes can be the list of diseases, while counterparts for words will be the observed symptoms. Also in the above example, instead of just considering words, we can also consider N-grams i.e. sequence of words. Experimentally it is found that 2-grams or 3-grams perform better that single words (1-grams) in the overall accuracy of the classifier.

Machine Learning Fundamentals Part 1 : Bayesian Probability

Lets start the discussion with a simple problem : You toss two identical coins simultaneously 100 times. What is the expected number of times at-least one heads will come up ? Since the event that a head comes up is independent from a tail coming up, hence sum of their probabilities should equal 1, i.e. probability of a head coming up is exactly equals to a tail coming up (assuming a fair coin) and that is 1/2. Now since there are 4 possible ways two coins can land, i.e. {H, H} {H, T}, {T, H} and {T, T} and each possibility has equal probability of appearing. Thus 3 out 4 possibilities has at-least one heads. Hence it is expected that 75 out of 100 times tossing the two coins will give at-least one heads.

What if I now say that given one coin comes up tails, what is the expected number of times at-least one heads will come up ? With the new information that one coin is tails, we see that 2 out of 4 possibilities satisfies the criteria that at-least one heads comes up. Thus 50 out of 100 times tossing the two coins will give at-least one heads with one tails.

Given a murder scene, and the list of 5 possible suspects (3 males, 2 females), the chances that one of them actually committed the murder is 1/5, this is the prior belief for each suspect. But then the detective discovers long hair besides the corpse. This becomes an evidence that the killer was a female (assuming that the deceased was a male). So now what should be the beliefs about each suspect. The updated beliefs are 1/2 for each female suspect and 0 for the male suspects. Bayes rule helps us in updating the prior beliefs given a set of evidences. Expressing the Bayes’ Theorem :

Probability of event A given evidence B = (Probability or Likelihood of observing B given event A occurred * Probability of A)/(Probability of B)

Coming to solving a machine learning problem with Bayes Theorem : Given a set of reviews on Amazon for a household product, which are labelled with “positive” and “negative” tags. Now for a new incoming review, what are the chances that the review is “negative”. Without any information about word distribution in the labelled reviews, if there are 75 positive and 25 negative reviews, then the probability of the incoming review being negative would be about 1/4. Now suppose our new review contains the word “broken” and it is reported that 60% of all the reviews, contains the word “broken” and 80% of the negative reviews contain the word “broken”, then what will be the updated probability of the new review being negative ? According to Bayes Theorem it would be 0.8*0.25/0.6 = 1/3.

We see that given the information about the word “broken” in the existing reviews and the new review, the chances of the new review being negative increases. This is the backbone of the Naive Bayes algorithm where instead of information about one word, we find all prior information about bag-of-words and then using Bayes’ Theorem we update the belief about the new event falling in one of the categories.