Document Sequencing using Dynamic Programming

Sectors dealing with large volumes of documents usually deploy batch scanning software for storing the documents in electronic format as image files. This software scans thousands of paper documents as a single batch and stores them in designated folder structure, and as a result multi-page documents lose their sequence information. With reference to the financial industry, scanning a bunch of 6 pages of a Mortgage application and 4 pages of an IRS Form using a batch scanner, could possibly end up numbering the pages such that the first 4 images are from the Mortgage application, the next two images from the IRS Form, the next 3 from the remaining Mortgage Application pages and the last remaining from the IRS Form.

There are techniques like putting 1-D or 2-D bar codes on the first pages of each document for separation. We will use an algorithmic technique to automatically sequence and separate the documents from the scanned image files. Firstly we will do OCR on the image files to convert them to raw text files and then use Supervised learning algorithms for textual classification.

The problem of classifying a page into a Loan Application, or a Mortgage or a Credit Report can be achieved with good enough accuracy using a supervised learner like a Support Vector Machine, or a Random Forest or an Ensemble of learners. We would usually train the model with classes such as Mortgage or Credit Report and so on. But when we need to identify sequences and sequence boundaries, we have to modify the model to consider the page type of a document along with the document type i.e. now instead of Mortgage, we would model with classes Mortgage First Page, Mortgage Middle Page and Mortgage Last Page.

Given our trained model and a bunch of pages to classify and sequence, the model predicts document type as well as the page type for each page in the bunch along with the confidence scores for the prediction. Lets denote the pages as Ai, the predicted document types as di, the predicted page types as pi and the confidence scores as si for i 1 to N. Assuming that the probability distribution of the confidence scores si follow a normal distribution, then any sequence of length k formed with the pages Aj, Aj+1, …, Aj+k-1, would have the most likely score which is the mean of the scores for each page in the sequence i.e.

Sj, j+k-1 = (sj + sj+1 + … + sj+k-1)/k,

where Sj, j+k-1 denotes the score of sequence starting at j of length k. Then the probability that the page Aj, with score sj, belongs to a sequence with score Sj, j+k-1 is proportional to

exp(-(Sj, j+k-1 – sj)2)

Thus the probability that each of the pages Aj, Aj+1, …, Aj+k-1, belongs to a sequence with score Sj, j+k-1 is the product of the above probabilities for each score si i.e.

Pj, j+k-1 = ∏i exp(-(Sj, j+k-1 – si)2) = exp(-∑i (Sj, j+k-1 – si)2) for i = j to j+k-1

For pages Ak, and Ak-1, if the predicted document types dk ≠ dk-1, then we assume that the page Ak cannot be in the same sequence as Ak-1, hence the probability of page Ak belonging to the previous sequence is 0, but Ak can start another sequence or it can be itself be a standalone sequence. In cases where the page Ak has a very low score then we can safely ignore the condition that dk ≠ dk-1,and consider it to be of the same document type as Ak-1 i.e. dk = dk-1 and continue forming a sequence without breaking.

Since we have modelled and predicted the page type for each page i.e. “First”, “Middle” and “Last”, we can use it for sequence formation. For example we can set rules that a sequence will only start at “First” page, or ends in a “Last” Page or a “Middle” page followed by “Middle” Pages and “Last” Pages in a sequence. In multi-page documents, the “First” page is discriminative from “Middle” and “Last” pages, but since a “Middle” page of one sample could be similar to the “Last” page of another sample and vice-verse in the training set, hence the “Middle” and “Last” pages are not very discriminative from each other in the predicted bunch. Thus we will only use the rule that a sequence starts at a “First” page.

Firstly we manually sequence a bunch of pages based on the “First” pages and if the predicted document type dk ≠ dk-1. After this we have multiple sequences from the original bunch of pages. Is it possible to further sub-sequence each sequence ?  Lets say that for the sequence {Aj, Aj+1, …, Aj+k-1} starting at the j-th index and of length k, the probability of the sequence is Pj, j+k-1 . Is there an index ‘v’, such that j ≤ v < j+k-1 and Pj, v * Pv+1, j+k-1 > Pj, j+k-1 ? If there is an index ‘v’ which satisfies the mentioned criteria then we further sub-sequence the sequence starting at j of length k, because it maximizes the probability of the sequence. If there are multiple such ‘v’ s then select the one for which Pj, v * Pv+1, j+k-1 is maximum, i.e.

arg maxv  {Pj, v * Pv+1, j+k-1 }> Pj, j+k-1

Again for each sub-sequence formed we can recursively sub-sequence each if we can find an index, at which if we separate we get a higher probability of each sub-sequence. This procedure would produce the maximum probable sequence. One way to do this is to use recursion on each sub-sequence and find the appropriate partitions. But recursion would do redundant calculations and we know the tool to solve the problem, yes Dynamic Programming.

For the sequence {Aj, Aj+1, …, Aj+k-1} find all 1-length sub-sequences. Each 1-length sub-sequences would have the probability equal to 1 since the mean of the score is the score itself i.e. Pj, j = 1. Then for all 2-length sub-sequences, {Aj, Aj+1}. For a 2-length sub-sequence if Pj, j * Pj+1, j+1 > Pj, j+1 then we have separate sub-sequences {Aj} {Aj+1} and updated Pj, j+1 = Pj, j * Pj+1, j+1 , else we have an “unbroken” sequence {Aj, Aj+1}. Extending this calculation to some q-length (q < k) sub-sequence, find

Q = arg maxv {Pj, v * Pv+1, j+q-1} for some j ≤ v < j+q-1

If Q > Pj, j+q-1 then split the the sub-sequence {Aj, Aj+1, …, Aj+q-1} at ‘v’, {Aj, Aj+1, …, Av} {Av+1, Aj+1, …, Aj+q-1} and assign the new and updated probability Pj, j+q-1 = Q. This works because we have already found out the probabilities Pj,v and Pv+1, j+q-1 in earlier steps and also the probabilities Pj, v and Pv+1, j+q-1 are optimal for the sub-sequences starting at j and ending at v and starting at v+1 and ending at j+q-1 respectively. Finally we do the calculations for the k-length sub-sequence and find ‘v’ for which Pj, v * Pv+1, j+k-1 is maximum and then partition the sequence into {Aj, Aj+1, …, Av} {Av+1, Aj+1, …, Aj+k-1}.

After finding the partition for the k-length sequence, how do we recover all the partitions for sub-sequences within the k-length sequence. We do it recursively. We use an array to track the partitions for each q-length sub-sequences in the earlier steps. After finding the partition ‘v’ for the k-length sequence, we recursively find the partitions for the v-j+1 length sub-sequence on the left and j+k-v length sub-sequence on the right using the array of tracked partitions.