TOPSOFT - SMART DOCUMENT PROCCESSING: SMARTCAPTURE, SMARTSCAN, SMARTINVOCE, FINEREADER, FORMREADER, LINGVO, OCR, OMR, ICR. PAPERLESS OFFICE SYSTEM, RECOGNITION, ARCHIVING, INFORMATION EFFICIENCY

The following editions of FineReader
are available for purchase:




ABBYY FineReader Engine 7.1 Functions and Features

The FineReader Engine API provides functions which could be divided into following groups:

Image manipulation and preprocessing

FineReader Engine can receive images from three types of sources: scanning via TWAIN interface (including ADF support and manual input feeding), getting directly from memory, or opening from files. It supports major imaging formats, including multi-page TIFFs, JPEG 2000 (part1), and works with black-and-white, grayscale and color images. It also can open PDF files, converting them into images using Adobe® PDF Library Technology.

ABBYY FineReader Engine 7.1 enables scanning parameters management: brightness, colority, resolution, image size, duplex scanning, pause beteewn pages setup, etc.

FineReader Engine also may save original and modified images into various formats. Full list input/output image formats is stated in ABBYY FineReader Engine 7.1 Specification section. 

Upon receiving images, FineReader Engine can perform the following preprocessing functions to improve the recognition:

  • Automatically de-skew images.
    This feature is essential to be used especially when images come from scanners and requires a compensation for image skew. It does not require leading edge borders or lines. For form processing software it also provides with a skew calculation based on the information from reference blocks.
  • Split dual pages.
    Works for scanning books as broadsides - both the left and right pages. The recognition quality is higher if, after scanning, the page is split into two, with each page corresponding to a single book page. Recognition and layout analysis are then performed separately for each page, along with de-skewing if required.
  • Despeckle image (or image clean-up).
    Designed to eliminate random noise (speckles). The image may have a large amount of "dust" present on it, i.e. a large number of excess dots. The dots arise in the case of documents of medium-to-low print quality, and dots located close to character outlines may have an adverse effect on recognition quality. These cases despeckle technique helps to improve the recognition quality.
  • Texture Filtering and Adaptive binarization.
    Texture filtering technology helps to filter out background "noise" such as color and texture, increasing accuracy for difficult-to-read documents such as newsprint, color documents, faxes, and copies. Innovative Adaptive Binarization technology dynamically adjusts threshold of brightness for each image fragment during the recognition. And by this usage of individual recognition parameters it produces significantly accurate recognition results for documents with gray or color variable contrast background and textures. Detailed description of how it works you can find in ABBYY FineReader Engine 7.1 Technology Background section.
  • Auto-detection of page orientation (90, 180, 270 degrees).
    This feature is very important when in a bulk imputing system it is unknown on which direction the image is scanned. The FineReader system automatically detects the orientation of each page and corrects it, if needed.
  • Manipulating with text color and background manipulation inside rectangles.
    It is an important feature for customers working with document management systems (DMS). The typical scenario of using this feature in archiving business is the following. A recognized image is stored as image and as plain text in an archive. Archive index of text also contains the coordinates if each character on the image. When a user receives a result of searching through archive, he gets an image of document as a source. But on that image, using the mentioned FineReader Engine function, the searching text is highlighted, changing the text color and background color within a rectangle, which completely outlines a found text.

FineReader Engine also offers a number of useful preprocessing functions, allowing to manipulate images such as "image scaling", "image clipping", "creating previews", "rotating (90, 180, 270 degrees)", "mirroring" and "inverting".

Back to Top        

Document Analysis

The document analysis function set of FineReader Engine API solves such tasks as automatic document conversion with full-page layout retention, zoning OCR with manually located blocks, and form processing with matching the templates etc. It includes:

  • auto-detection of page orientation - 90, 180, 270 degrees (see above in Image manipulation and preprocessing);
  • auto-detection of text blocks, tables, barcodes and pictures;
  • auto-detection of vertical text in table cells;
  • manual block zoning (adding, removing and editing blocks);
  • template auto-identification and matching for form processing. More detailed description is in Form processing below.

One of the unique FineReader Engine features is:

  • Document Analysis for Invoices.

    A special document analysis function designed as a preprocessing engine for converting semi-structured documents, such as invoices, payment drafts, checks, transfers, business cards, agreement, health claim forms, resumes, etc. In this preprocessing role, this function has been designed to find as much text on these documents as possible, including characters and numbers — even if this information is located within stamps, pictures, logos or small-text areas.

    Unlike in standard full-page document analysis, this specialized document analysis assumes all printed information on the documents is text. It also ensures that important text information is not identified as graphic elements and that words or numerical values are not separated into multiple characters. As a result, maximum information about the text, including its coordinates, is available for analysis, field-by-field processing and parsing at subsequent processing stages by other systems.

    Document Analysis for Invoices is used in FlexiCapture Studio as a first step of semi-structured document analysis, helping to extract data from unstructured forms and documents with similar data but different layouts. You may find more information of how FlexiCapture works.

Back to Top        

Recognition OCR
  • Recognize up to a total of 186 machine print languages.
  • 177 languagesfor OCR with Latin, Cyrillic, Greek and Armenian characters.
  • 34 languagesEntire list of supported OCR / ICR languages see in 
  section “Specifications” have dictionary/morphology support.
  • Recognition of multilingual documents.
  • Recognition of dot-matrix document.
    FineReader Engine recognizes dot matrix texts of many types. Tested with several thousand samples on a variety of printers including dot matrix, daisy wheel, chain and band printers, draft and Near Letter Quality (NLQ).
  • Recognition of typewritten documents.
  • Chinese, Japanese, Korean (CJK) Character Recognition (see more detailed information about CJK OCR in “Licensing Policy, add-on modules” section ).
  • Fast mode recognition.
    Designed for high-volume document processing applications where speed is more important than accuracy. This mode increases processing speed by 200-250%, making it particularly useful with document management and archiving systems.
  • Recognition of OCR-A, OCR-B, and MICR (E13B).
  • FineReader XIX.
    There are many old documents, books, and newspapers published in the 17-20th century all over the world. Most of them are very rare and many are unique. Stored in the archives of libraries and government organizations, they are national heritage that must be preserved. The best solution is to digitize them. The set of functions called as “FineReader XIX” of ABBYY FineReader Engine 7.1 provides a UNIQUE capability to recognize texts published in the period from 1600 till 1937 in English, French, German, Italian, and Spanish. FineReader XIX supports special fonts such as Fraktur, Schwabacher and the majority of Gothic fonts.
Entire list of supported OCR / ICR languages see in “Specifications” section. ICRTypes of handprint text
  • Recognize up to a total of 90 handprint languages.
  • 17 languagesEntire list of supported OCR/ICR languages see in 
  section “Specifications” with Latin characters, Greek and 3 languagesEntire list of supported OCR/ICR languages see in 
  section “Specifications” with Cyrillic characters (Cyrillic available only upon request) with morphology/dictionary support.
  • 69 languagesEntire list of supported OCR / ICR languages see in 
  section “Specifications” with Latin characters without dictionaries.
  • Recognition of hand-printed characters in various field borders and frames - underlined fields, boxes, comb-style fields, etc.
  • Fast mode recognition.
    It is designed for high-volume document processing applications where speed is most important. This mode increases processing speed by 200-250%.
  • Multilingual ICR.
    One the main advantages of ABBYY ICR is that it delivers almost the same high accuracy of ICR on digits, digits combined with letters of one language, and digits combined with letters of several languages, even if fields contain both upper and lower case letters.
  • Supports 22 styles of hand-writing of different countries and areas: European, American, Canadian, Russian, Japanese, Arabic and Thai.
  • Supports Indian ICR digits that are used in Arab states.
OMR, Barcode
  • Recognition of 1D barcodes.
    FineReader Engine supports the most popular 1D barcodes: Code 39, Checked Code 39, Interleaved 25, Checked Interleaved 25, EAN 8, EAN13, Code 128, CODABAR (without checksum), UCC Code 128, Code 2 of 5 (Industrial, IATA, Matrix), Code 93, UPC-A, UPC-E, and Postnet barcodes.
  • 2D Barcode Recognition (PDF417).
    The 2D Barcode recognition recognizes PDF417, the industry standard for 2D barcodes. PDF417 encodes up to 1.1 kilobytes of data, including text and graphics information.
  • Fast barcode extraction.
    This feature enables automatic finding and recognizing barcodes at any angle on a document. It works both for 1D and 2D barcodes. Optical Mark Recognition
  • OMR (Optical Mark Recognition).
    The Optical Mark Recognition recognizes simple checkmarks, grouped checkmarks, model checkmarks and checkmarks with “corrections” made by hand. It delivers an accuracy rate of 99.995%.
User languages

Here are two examples where user languages will help you to improve recognition quality.

In documents filled out by hand, the values in the form fields usually belong to a specific set such as city names, countries, zip codes, product codes, sums, etc. To improve the quality of ICR recognition, you can use user languages to describe the information which may be entered in each field.

If a document contains a lot of "unnatural" structures such as product codes, telephone numbers, passport numbers etc., recognition errors may occur. This happens because the program reads such structures letter by letter. To improve the recognition of product codes and the like, you can create a new recognition language which will help the program to read specific types of data correctly.

The FineReader Engine provides an API for creating and editing recognition languages, creating copies of system recognition languages and adjusting them, and adding new words to user languages.

Pattern Training

In the vast majority of cases FineReader Engine can successfully read texts without prior training. However, in such cases as recognition of decorative or outlined fonts or bulk input of low print quality documents, preliminary pattern training will prove useful.

The FineReader Engine allows you to create and use user patterns or import them from the ABBYY FineReader desktop application (Professional or Corporate Edition). FineReader Engine is flexible and applicable to build up an application of any architecture, either it is a client workstation designed from scratch (there is no ready user interface in FineReader Engine), or a server-based solution.

Back to Top        

Semi-structured and unstructured formsForm processing

The FineReader Engine 7.1 API provides new form processing functionality with added support for ABBYY FormReader and ABBYY FlexiCapture Studio. It is now possible to process both fixed forms and semi-structured forms and documents. Fixed-form processing capability allows a system to process forms that have the same layout and in which data is at the same exact location, such as tax return forms, questionnaires and registration cards. With ABBYY FlexiCapture Studio, systems can process and extract information from documents and forms with similar data but different layouts and data location, such as invoices, claim forms, EOBs, resumes, and contracts.

ABBYY FormReader is an integral part of ABBYY FineReader Engine 7.1, while FlexiCapture Studio is available through additional license.

A typical scenario for form processing using FineReader Engine is based on the following steps. First it is required to create a template using FormReader's template designer or a FlexiLayout using FlexiCapture Studio. A templates identifies the location of the data need to be captured with marking of the fields and reference marks (such as corners, crosses, static texts, etc.). A FlexiLayout is a formalized description of semi-structured forms. FlexiLayout tells FineReader Engine how to look for a particular field. The template or FlexiLayout contains the information of the layout of forms or documents, the absolute or relative positions of reference marks and fields, the relation between marks and fields, and the types of data for each of field on a form, including user-defined data types. Then the template or FlexiLayout is exported into FineReader Engine with the aid of its API for subsequence form matching, data capturing, exporting, or any other post-recognition manipulation.

To learn more about FlexiCapture Studio, see FlexiCapture Studio description section.

FineReader Engine API combined with included FormReader and FlexiCapture Studio programs provide developers with the following capabilities to process fixed and semi-structured forms and documents:

Features and functions for processing fixed forms

  • Built-in Template Designer.
    FormReader includes a "Template Editor", which allows developers to create a template for template matching and data capture. This tool is designed to create a template for already printed (or even already filled out) forms.
  • FormDesigners.
    The FormDesigner allows developers to create a new form from scratch. The FormDesigner, which is seamlessly integrated with the Template Editor, allows to create forms and corresponding templates by drawing them only once.
  • Various reference elements: black squares, text, lines, images.
  • Multi-line text in fields.
    During template creation, it is possible to mark several text lines as one text field. After recognition all information in such text lines could be exported into one field of database.
  • User-defined data types.
    It is possible to set up a user-defined data type for a field, such as regular expression, restricted character set, dictionary from text file, or any combination of them.
  • Specialized data types.
    FormReader includes specialized data types for 21 languages, which help to recognize fields like First Name, Last Name, City, E-mail, Address, Phone, Country, etc. These dictionaries ensure that the OCR/ICR Engine will use all of the available information to achieve the best possible accuracy.
  • Template auto-identification.
    It is possible to export into FineReader Engine many fixed form templates. Matching procedure will automatically choose the right template for each form page.
  • Compensation of linear distortions using "cornerstones" (for faxed forms etc.).
    FineReader Engine includes new algorithms to compensate for linear distortions for better recognition of faxes or distorted documents.

Back to Top        

Features and functions for processing semi-structured forms and documents

  • Create FlexiLayout using FlexiCapture Studio

    • Developers can use FlexiCapture Studio to create FlexiLayout. It is describes the logic of data extraction from semi-structured forms and documents. It contains two types of element: simple elements (e.g. static text, separator, white gap, barcode, character string, text fragment, object collection and date) and compound elements, i.e. simple (and compound) elements joined by AND.
    • It is possible to employ a full-page pre-recognition process prior to creating FlexiLayout. This pre-recognition process automatically detects objects such as text, separators, barcodes, inverted text, square checkmarks and pictures. This pre-recognition process could be run in fast mode.
    • Each FlexiLayout's element has properties which describe the geometrical position and relationship between elements on the form. Such description is visualized, so the user may set properties in UI (User Interface) dialogs and controls of FlexiCapture Studio. But also the description could be created in advanced FlexiLayout language (like programming script). For example, you can tell the program to look for object A to the left/right of object B, or you can tell the program that object A will be closest to object B so that it can use object A as a reference, or tell it not to look for object A if it finds object B.
    • When the program uses FlexiLayout to map the image, it formulates a series of hypotheses (i.e. assumptions that the detected objects correspond to the elements) for all the created elements. All hypotheses are ranked by their quality (the program estimates how well the detected object matches the description contained in the FlexiLayout and penalizes poor matches).
      Hypotheses are arranged in a tree-like structure, mapping out the relationships between a FlexiLayout element and a possible hypothesis. When matching the FlexiLayout with images, the program selects the best branch in the tree of hypotheses which includes all the described elements, from first to last. This provide a visual of the structure for the users allowing them to see how the program selects the best match and adjust the FlexiLayout to achieve the matching.
    • Using the concept of a null hypothesis, some objects can be specified as optional. For example, if the program does not find an optional object, instead of assuming that the image and FlexiLayout cannot be matched, the program will advance a null hypothesis and assume that there is no such object on the form.
    • If an element is not optional, it can be used as an identifier of the type of document. This allows FineReader Engine to choose the best template from the variety of templates available in the batch in case of multi-type document processing.

  • Testing FlexiLayout within FlexiCapture Studio

    • FlexiLayouts can be tested on test batch to check how well the FlexiLayout matches with images.
    • Step-by-step improvement of FlexiLayouts: new form samples can be gradually added so that the FlexiLayout can be tested on even more variations of a form of a particular type.

All other features like image preprocessing, recognition and exporting are the same of FineReader Engine.

Back to Top        

Receiving and exporting recognized text

The FineReader Engine API provides a wide range of options for export of recognition results, including different levels of document reconstruction:

  • A set of different levels of text format retention during export to external formats (from simple text with no formatting to complete page layout retention, including columns, tables, frames, fonts, font size, paragraph styles, borders, etc.).
  • Providing access to detailed information about each recognized character.
  • A set of functions to post-editing and post-formatting of the recognized text before its exporting.
  • Exporting recognized text into various formats (full list of formats see in ABBYY FineReader Engine 7.1 Specification).
  • Retaining full page layout of documents.
  • Replacing uncertain characters with their corresponding images when saving in PDF format.
  • Retaining picture and text color in full.

Read more about ABBYY FineReader Engine 7.1 technology background...

 Features

 Sections