Sectors dealing with large volumes of documents usually deploy batch scanning software for storing the documents in electronic format as image files. This software scans thousands of paper documents as a single batch and stores them in designated folder structure, and as a result multi-page documents lose their sequence information. With reference to the financial industry, scanning a bunch of 6 pages of a Mortgage application and 4 pages of an IRS Form using a batch scanner, could possibly end up numbering the pages such that the first 4 images are from the Mortgage application, the next two images from the IRS Form, the next 3 from the remaining Mortgage Application pages and the last remaining from the IRS Form.

There are techniques like putting 1-D or 2-D bar codes on the first pages of each document for separation. We will use an algorithmic technique to automatically sequence and separate the documents from the scanned image files. Firstly we will do OCR on the image files to convert them to raw text files and then use Supervised learning algorithms for textual classification.

The problem of classifying a page into a Loan Application, or a Mortgage or a Credit Report can be achieved with good enough accuracy using a supervised learner like a Support Vector Machine, or a Random Forest or an Ensemble of learners. We would usually train the model with classes such as Mortgage or Credit Report and so on. But when we need to identify sequences and sequence boundaries, we have to modify the model to consider the page type of a document along with the document type i.e. now instead of Mortgage, we would model with classes Mortgage First Page, Mortgage Middle Page and Mortgage Last Page.

Given our trained model and a bunch of pages to classify and sequence, the model predicts document type as well as the page type for each page in the bunch along with the confidence scores for the prediction. Lets denote the pages as A_{i}, the predicted document types as d_{i}, the predicted page types as p_{i} and the confidence scores as s_{i} for i 1 to N. Assuming that the probability distribution of the confidence scores s_{i} follow a normal distribution, then any sequence of length k formed with the pages A_{j}, A_{j+1}, …, A_{j+k-1}, would have the most likely score which is the mean of the scores for each page in the sequence i.e.

S_{j, j+k-1} = (s_{j} + s_{j+1} + … + s_{j+k-1})/k,

where S_{j, j+k-1} denotes the score of sequence starting at j of length k. Then the probability that the page A_{j}, with score s_{j}, belongs to a sequence with score S_{j, j+k-1} is proportional to

exp(-(S_{j, j+k-1} – s_{j})^{2})

Thus the probability that each of the pages A_{j}, A_{j+1}, …, A_{j+k-1}, belongs to a sequence with score S_{j, j+k-1} is the product of the above probabilities for each score s_{i} i.e.

P_{j, j+k-1} = ∏_{i} exp(-(S_{j, j+k-1} – s_{i})^{2}) = exp(-∑_{i }(S_{j, j+k-1} – s_{i})^{2}) for i = j to j+k-1

For pages A_{k}, and A_{k-1}, if the predicted document types d_{k} ≠ d_{k-1}, then we assume that the page A_{k} cannot be in the same sequence as A_{k-1}, hence the probability of page A_{k} belonging to the previous sequence is 0, but A_{k} can start another sequence or it can be itself be a standalone sequence. In cases where the page A_{k} has a very low score then we can safely ignore the condition that d_{k} ≠ d_{k-1},and consider it to be of the same document type as A_{k-1} i.e. d_{k} = d_{k-1} and continue forming a sequence without breaking.

Since we have modelled and predicted the page type for each page i.e. “First”, “Middle” and “Last”, we can use it for sequence formation. For example we can set rules that a sequence will only start at “First” page, or ends in a “Last” Page or a “Middle” page followed by “Middle” Pages and “Last” Pages in a sequence. In multi-page documents, the “First” page is discriminative from “Middle” and “Last” pages, but since a “Middle” page of one sample could be similar to the “Last” page of another sample and vice-verse in the training set, hence the “Middle” and “Last” pages are not very discriminative from each other in the predicted bunch. Thus we will only use the rule that a sequence starts at a “First” page.

Firstly we manually sequence a bunch of pages based on the “First” pages and if the predicted document type d_{k} ≠ d_{k-1.} After this we have multiple sequences from the original bunch of pages. Is it possible to further sub-sequence each sequence ? Lets say that for the sequence {A_{j}, A_{j+1}, …, A_{j+k-1}} starting at the j-th index and of length k, the probability of the sequence is P_{j, j+k-1} . Is there an index ‘v’, such that j ≤ v < j+k-1 and P_{j, v} * P_{v+1, j+k-1} > P_{j, j+k-1} ? If there is an index ‘v’ which satisfies the mentioned criteria then we further sub-sequence the sequence starting at j of length k, because it maximizes the probability of the sequence. If there are multiple such ‘v’ s then select the one for which P_{j, v} * P_{v+1, j+k-1} is maximum, i.e.

arg max_{v} {P_{j, v} * P_{v+1, j+k-1} }> P_{j, j+k-1}

Again for each sub-sequence formed we can recursively sub-sequence each if we can find an index, at which if we separate we get a higher probability of each sub-sequence. This procedure would produce the maximum probable sequence. One way to do this is to use recursion on each sub-sequence and find the appropriate partitions. But recursion would do redundant calculations and we know the tool to solve the problem, yes Dynamic Programming.

For the sequence {A_{j}, A_{j+1}, …, A_{j+k-1}} find all 1-length sub-sequences. Each 1-length sub-sequences would have the probability equal to 1 since the mean of the score is the score itself i.e. P_{j, j} = 1. Then for all 2-length sub-sequences, {A_{j}, A_{j+1}}. For a 2-length sub-sequence if P_{j, j} * P_{j+1, j+1} > P_{j, j+1} then we have separate sub-sequences {A_{j}} {A_{j+1}} and updated P_{j, j+1} = P_{j, j} * P_{j+1, j+1} , else we have an “unbroken” sequence {A_{j}, A_{j+1}}. Extending this calculation to some q-length (q < k) sub-sequence, find

Q = arg max_{v} {P_{j, v} * P_{v+1, j+q-1}} for some j ≤ v < j+q-1

If Q > P_{j, j+q-1} then split the the sub-sequence {A_{j}, A_{j+1}, …, A_{j+q-1}} at ‘v’, {A_{j}, A_{j+1}, …, A_{v}} {A_{v+1}, A_{j+1}, …, A_{j+q-1}} and assign the new and updated probability P_{j, j+q-1} = Q. This works because we have already found out the probabilities P_{j,v} and P_{v+1, j+q-1} in earlier steps and also the probabilities P_{j, v} and P_{v+1, j+q-1} are optimal for the sub-sequences starting at j and ending at v and starting at v+1 and ending at j+q-1 respectively. Finally we do the calculations for the k-length sub-sequence and find ‘v’ for which P_{j, v} * P_{v+1, j+k-1} is maximum and then partition the sequence into {A_{j}, A_{j+1}, …, A_{v}} {A_{v+1}, A_{j+1}, …, A_{j+k-1}}.

After finding the partition for the k-length sequence, how do we recover all the partitions for sub-sequences within the k-length sequence. We do it recursively. We use an array to track the partitions for each q-length sub-sequences in the earlier steps. After finding the partition ‘v’ for the k-length sequence, we recursively find the partitions for the v-j+1 length sub-sequence on the left and j+k-v length sub-sequence on the right using the array of tracked partitions.