Skip to main content

Optical Character Recognition Sample Process (OCR Sample)

Optical Character Recognition Sample Process (OCR Sample)

Overview

OCR Sample is designed to test and refine various image preprocessing techniques to prepare documents for OCR. Systematically exploring and fine-tuning techniques like image deskewing, rotating, resizing, removing pespective distortion, reducing noise, increasing contrast, etc. with the help of image processing scripts or ImageMagick options can help improve OCR accuracy and efficiency. This process is crucial for further data extraction, ensuring that text within images is accurately recognized. The provided settings allow to explore the ImageMagick and Tesseract options as well as a sequential execution of ImageMagick scripts. 

Prerequisites

In order to successfully set up and run OCR Sample:

  1. Ensure that you have a running node with the "AP_RUN" and "SELENIUM" capabilities.
  2. Upload the OCR Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-ocr-ap/<EasyRPA version>/easy-rpa-ocr-ap-<EasyRA version>-bin.zip
    The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-ocr-ap
  3. To run the demo "Preprocess" option needs to be launched on a selected document set of the package.

OCR Sample Package Structure

FolderTypeDescription
IE Document ProcessorAutomation ProcessStandard information extraction automation process
OCR_SAMPLE_TEXTCLEANERDocument SetInformation Extraction Document Set. Contains documents and document set settings to demo textcleaner script execution results.
OCR_SAMPLE_UNPERSPECTIVEDocument SetInformation Extraction Document Set. Contains documents and document set settings to demo otsuthresh and unperspective scripts execution results.
OCR_SAMPLE_TEXTCLEANERDocument TypeInformation Extraction Document Type
storage/dataStorageProvides image preprocesing scripts.

OCR_SAMPLE_TEXTCLEANER Document Set 

In this example textcleaner script is run on a set of text documents images to remove noise and clean the text background.

JSON Settings for TEXTCLEANER script

Below you can find an example of JSON settings for TEXTCLEANER script provided in the Document Set Details. 

TEXTCLEANER script JSON settings example
"imagePostprocessScriptsBucket": "data/ocr_sample/scripts",
	"imagePostprocessScripts": {
	"textcleaner": [
		"-g",
		"-e",
		"stretch",
		"-f",
		"25",
		"-o",
		"15",
		"-t",
		"30",
		"-u",
		"-s",
		"1",
		"-T",
		"-p",
		"20"
	]
	}

TEXTCLEANER script JSON settings contain:

  • "imagePostprocessScriptsBucket": "data/ocr_sample/scripts"

    • This specifies the directory where the image preprocessing scripts are stored. 
  • "imagePostprocessScripts":

    • This section defines the specific image preprocessing scripts and their parameters.
  • "textcleaner":

    • The name of the preprocessing script being utilized. 
  • Parameters for "textcleaner" script:

    • "-g": converts the image to grayscale.
    • "-e stretch": enables automatic image brightness enhancement.
    • "-f 25": sets the filter radius to 25 pixels.
    • "-o 15": sets the offset value to 15.
    • "-t 30": sets the smoothing threshold value to 30.
    • "-u": applies deskew to the image.
    • "-s 1": sets the sharpness amount to 1.
    • "-T": enables image trimming.
    • "-p 20": sets the border pad amount to 20.

TEXTCLEANER script results example

Example of a document image before TEXTCLEANER script execution:

Example of a document image after TEXTCLEANER script execution:

OCR_SAMPLE_UNPERSPECTIVE Document Set 

In this example a sequence of otsuthresh and unperspective scripts is run on a set of document images to remove pespective distortions.

JSON Settings for OTSUTHRESH and UNPERSPECTIVE scripts

Below you can find an example of JSON settings for OTSUTHRESH and UNPERSPECTIVE scripts provided in the Document Set Details. 

OTSUTHRESH and UNPERSPECTIVE scripts JSON settings example
"imagePostprocessScriptsBucket": "data/ocr_sample/scripts",
	"imagePostprocessScripts": {
	"otsuthresh": [],
	"unperspective": [
		"-C",
		"black",
		"-i",
		"save",
		"-A",
		"12",
		"-s",
		"5",
		"-t",
		"10",
		"-B",
		"1",
		"-d",
		"h",
		"-a",
		"0.79"
	]
	}
  • "imagePostprocessScriptsBucket": "data/ocr_sample/scripts"

    • This specifies the directory where the image preprocessing scripts are stored. 
  • "imagePostprocessScripts":

    • This section defines the specific image preprocessing scripts and their parameters.
  • "otsuthresh":

    • This is the name of the first preprocessing script, which is "otsuthresh."
  • Parameters for "otsuthresh" script:

    • This script is invoked without any additional parameters.
  • "unperspective":

    • This is the name of the second post-processing script, which is "unperspective."
  • Parameters for "unperspective" script:

    • "-C black": sets the background color to black.
    • "-i save": saves intermediate images during processing.
    • "-A 12": sets the area threshold for connected components filtering to 12 degrees.
    • "-s 5": sets the smoothing amount to remove false peaks to 5.
    • "-t 10": sets the threshold value for removing false peaks to 10.
    • "-B 1": sets the blurring amount to 1.
    • "-d h": sets the output dimensions to input image height.
    • "-a 0.79": sets the desired width/height aspect ratio to 0.79.

OTSUTHRESH and UNPERSPECTIVE scripts results example

Example of a document image before a sequence of OTSUTHRESH and UNPERSPECTIVE scripts execution:

Example of a document image after a sequence of OTSUTHRESH and UNPERSPECTIVE scripts execution:

OCR_SAMPLE_SMARTTRIM Document Set 

In this example SMARTTRIM script is run on a set of scanned document images to trim an image around the text area.

JSON Settings for SMARTTRIM script

Below you can find an example of JSON settings for SMARTTRIM script provided in the Document Set Details. 

  • "imagePostprocessScriptsBucket": "data/ocr_sample/scripts"

    • This specifies the directory where the image preprocessing scripts are stored. 
  • "imagePostprocessScripts":

    • This section defines the specific image preprocessing scripts and their parameters.
  • "smarttrim":

    • The name of the preprocessing script being utilized. 
  • Parameters for "smarttrim" script:

    • "-m corners": enables morphologic corner detection,
    • "-g grayscale": sets mode of converting to grayscale,
    • "-r  no": disables restrict to only largest thresholded region using connected components,
    • "-f 9": sets percent of fuzz value,
    • "-p 250": sets padding around croped area





To read more about OCR image processing pipeline, see Digitizing Documents (OCR).

To explore the techniques of OCR quality analysis and improvement, see OCR Tuning Guide