Monday, November 06, 2006

Updated OCR Engines

We currently ship with Optical Character Recognition technology developed by two different companies: Abbyy, which makes FineReader, and Nuance, which makes OmniPage. In general, the commercial versions of OCR products from those companies are released first. Later, sometimes much later, they package those technologies as toolkits that other companies, such as ours, can license and use. Both of those companies provided a new version of OCR technology in time for Version 11 of Kurzweil 1000 - FineReader Version 8, and ScanSoft Version 15.

We've taken advantage of some new features in both products, giving us some new functionality for analyzing forms (see an earlier post) and for opening PDF files. Mostly, though, we were interested in recognition improvements. We do see some, though we haven't done a large scale analysis yet of the differences between earlier versions and this one in terms of recognition accuracy.

One important point - the latest version of ScanSoft OCR no longer supports Windows '95, Windows '98, or Windows ME. If you use any of those operating systems, we will install an older version of ScanSoft OCR - version 12.6 - rather than the current one. As a consequence, form recognition will not be available.

Both engines now provide some speed control - that is, you can ask the engine to recognize something quickly, perhaps at the expense of the accuracy of the recognition. Some of you may remember that we listed ScanSoft's recognition engine twice in our previous release - one listing for speed, the other for accuracy. Now you will find it listed only once, because a separate setting applies to both recognition engines. That setting is labeled "Recognition Approach" - the setting options are "Accuracy" or "Speed". You'll find it immediately before the Engine setting in the Recognition Settings dialog. You'll also find that we removed two settings: Recognition of Light Text on a Dark Background and Questionable Character Markup. The former is not controllable in some engines, and is now always enabled. The latter wasn't possible with one of the new engines, and seemed to be used rarely, if at all.

While I'm on the subject of recognition settings, let me mention an important one. I'll talk about it more in a post on conversion settings, but it really has to do with character recognition. In the Conversion Settings Dialog, you'll find one setting that has to do with opening PDF documents. The setting is labeled "Emphasis", and your choices are "Recognition of Images" and "Extraction of Text". PDF files are unusual in that they contain both text and images. Sometimes they have no text, but they do have images of text - this happens most often when the person who created the PDF file used a scanner to make it. Sometimes they have text for portions of a page, but not for all of it. Sometimes the text for the full page is available, but it is not clear how it should be ordered. Both OCR engines can extract the text, if it is there, and use it. If you indicate that Extraction of Text should be emphasized, they will use the images only to establish the position of the text and its reading order. If you indicate that Recognition of Images should be emphasized, the text will be used only on a word by word basis to correct simple recognition errors. Recognition of Images is the default, although, if your PDF file has all of its text, it will be slower and less accurate than the other option. Unfortunately, if your PDF file contains images of text for which there is no text that can be extracted, the other option can cause entire sections of a page to be skipped.

Although we haven't independently verified the vendor's claims, I thought you might be interested in their claims about improvements in their new OCR engines.

For Abbyy, see http://www.abbyy.com/sdk/?param=35469
For Nuance, see http://www.nuance.com/omnipage/capturesdk/whatsnew.asp

I have reproduced some of the more pertinent claims here.

Abbyy has added a "Fast Mode", performing recognition up to two times faster.

Intelligent image analysis in FineReader Engine 8.0 delivers higher recognition accuracy. FineReader technology automatically adjusts its algorithms to account for image condition, resulting in increased accuracy by up to 30% on low resolution documents (scanned at under 200 dpi or faxes).

Abbyy also claims that their PDF processing is up to two times faster, and often more accurate, as they do a better job of analyzing the internal information within source PDF files, including annotations, metadata, text objects, font dictionaries, and content streams.

Nuance suggests that its newly developed 3-way voting system provides a 36% increase in accuracy over previous versions.

Your mileage, as they say, may vary. We'd be interested in what you think once you've taken version 11 out for a spin.

2 Comments:

Blogger Pranav Lal said...

Do you have any plans to add math recognition using Infty reader? (I hope I am spelling that right. Its either Infty or Infti reader)

2:05 AM  
Blogger K1000Engineer said...

I'd certainly like too. There are, however, a lot of hurdles before it could be done.

First, although the Infty project has made great progress, it isn't very robust. It works with some material, not others, and the scanning step has to be done very carefully.

Second, its not enough to do OCR. You need some mechanisms for editing, for speaking, and for interacting with complex formulae. There are some good efforts here too - some proprietary, some public - but its a lot of work getting it all together. There are likely to be some big efforts at the U.S. Government level to improve math literacy in the United States, and I expect we will see some better technical tools to make that possible within a few years.

Stephen

9:12 AM  

Post a Comment

<< Home