IBM Almaden Research Center

Sandeep Gopisetty

Members

Moidin Mohiuddin

Raymond Lorie

Richard Casey

Hiroyasu Takahashi

Sandeep Gopisetty

Jianchang Mao

Sean McCauliff

SriGanesh Madhavnath

Tom Griffin

Thomas Truong

Document Analysis and Recognition (1991-2000)

Field Extraction and Recognition

OCR Example

Field Extraction involves creating the image that contains the data from each field that was specified by the user. The extracted image contains no background text or form lines. This image can then be used to recognize the data in each field without recognizing data that is a part of the form. During the copy, any form text, extraneous lines (such as the tail of a signature) and form lines are removed. This leaves only the data that is to be captured and converted to text in the extracted_image.

Character Recognition involves extracting all the connected components from a field and performing either splitting or joining individual characters. This allows us to return a different number of characters for each of several split/join selections. For example, if there was a connected component Me in which the M and e were touching, the system might return 'M'-'e' , 'N'-'I'-'e' and 'I'-'V'-'I'-'e' as possible results of splitting the component. Another example, if there were two connected components '5','-' were the '-' is the top of 5, the system might return '5' and '5','-' as choices.

Once the characters are isolated, the characters are normalized, features extracted and classification is performed using neural networks depending upon the field data type.

Cursive Recognition


Cursive Example

Cursive word recognition is a very challenging problem because of ambiguous character boundaries, ligatures, and a large variation of writing styles in cursive handwriting.

We have developed a cursive word recognition system. The system is based on a modified Hidden Markov Model (HMM) and Dynamic Programming techniques. A cursive word is first segmented into a sequence of pieces called graphemes. A letter can have one, two or three graphemes (thus, a word is usually over-segmented). The neighboring graphemes are then assembled in such a way that a probability measure under the HMM framework is optimized.

Contextual Processing

Basic OCR recognition engines recognize digits and letters with a reasonable accuracy. Still, the recognition of words or phrases remains a challenge. The longer the word or phrase, the less the probability of recognizing the value without error. Fortunately, the use of contextual knowledge enables us to improve the recognition quite substantially. The context is defined as a set of constraints that the recognized information must satisfy: the more constraints, the less room for errors. For example, the sequence (city, state, zip) must satisfy the following constraints: the zip is composed of 5 digits; its value must belong to the list of possible zips, and the city must exist and correspond to the zip.

In many of today's systems, the exploitation of context is left to the application program. On the contrary, we are building a general context analyzer that can be customized for a particular application. A language called Document Specification Language (DSL) is provided to specify the syntax of the information to be recognized as well as constraints stating that a recognized value must exist in a given dictionary or that a particular routine applied to recognized value(s) must return TRUE.

The context analyzer interprets the tokens returned by the recognition engine (with all its multiple choices) according to the constraints imposed by the DSL "program." It looks at the multiple combinations, picks up the ones that satisfy the constraints and chooses the most probable one.

Example:

Contextual Example


The ICR returns the following choices:

      K e a c A e v       M O        6 4 0 6 0        r u r   y y                      2   8    
CheckMate finds the following solution,
      K e a r n e y       M O        6 4 0 6 0      (with confidence 70/100)    
over the second choice (with a much lesser confidence):
      F   a r l e y       M O        6 4 0 2 8      (with confidence 54/100)