Skip to main content

Built-in OCR

Built-in OCR

OCR Flow

Optical Character Recognition, commonly referred to as OCR, is the process of converting scanned images of letters and words into a electronic versions.

ImageMagick

The first step is ImageMagick processing. It splits document by pages and converts them into images using provided configurations. Results are saved as .jpg files.

ImageMagick Scripts

There is also a range of image processing scripts based on ImageMagick's functions. A sequential execution of image processing scripts is allowed, where each subsequent script utilizes the output file of the previous script as its input. This enables a cascading effect, where multiple scripts can be applied in a specific order to enhance the image quality.

If a script fails to produce an output file during execution, the input file of that script is passed on to the subsequent steps, a corresponding log message is generated.

When running multiple instances of the same script in succession, it is important to create separate exemplars of the script under different names within the same directory. Each exemplar should be assigned a unique name to avoid conflicts. The settings should then be updated accordingly to reference the specific names of each exemplar in the sequential order they are intended to run.

Tesseract

At the next step Tesseract converts document images into text and HOCR formats and saves results.

PostProcessor

After the end of Tesseract conversion results handled by HocrPostProcessor.
Process developer may optionally provide a configuration that specifies pairs of recognized and correct words.
The HocrPostProcessor component goes over HOCR words, replaces all words matching recognized pattern with the corresponding correct word.

OCR Container Details

Input

OCR container expects JSON message as input. Example:

{
	"taskUuid": "41125e8d-8b9e-4e42-b0ef-739594a21c50",
	"runUuid": "5f425d18-06ac-48f8-ad2b-5af4a24daca7",
	"documentLocation": "template1.pdf",
	"formats": [
		"hocr",
		"text",
		"json",
		"image"
	],
	"configuration": {
		"bucket": "ocr",
		"tesseractOptions": [
			"-l",
			"deu",
			"--psm",
			"3",
			"--oem",
			"3",
			"--dpi",
			"800"
		],
		"imageMagickOptions": [
			"-resample",
			"450",
			"-density",
			"350",
			"-quality",
			"100",
			"-background",
			"white",
			"-alpha",
			"flatten",
			"-colorspace",
			"RGB"
		],
		"hocrFixWords": {
			"INVOICE": "InvoiceFixed"
		}
	}
}

Output

OCR container returns list of OcrOutput bjects (one per page), which are filled depending on the input "format" parameter.

  • image - image path and image dimension will be returned
  • text - path to the Tesseract text file
  • hocr - path to the postprocessed Tesseract Hocr file
  • json - returns Json structure which is obtained from Hocr file using HocrJsonConverter
  • tainput - optional debug information that contains link to the image supplied to Tesseract

To explore the OCR image processing pipeline and test available options, you can use Optical Character Recognition Sample Process (OCR Sample).

To read more about techniques of OCR quality analysis and improvement, see OCR Tuning Guide