EasyRPA Text Extraction Capabilities
EasyRPA Text Extraction Capabilities
ImageMagick
ImageMagick is a software suite used within EasyRPA for editing images. In the context of text extraction, it serves as the primary image preprocessing engine. Its role is to prepare and clean document images before they are sent to an OCR engine like Tesseract or PaddleOCR, significantly improving the accuracy and reliability of text recognition.
ImageMagick Capabilities
The key capabilities of ImageMagick for document preprocessing include:
a wide range of image enhancement and cleanup operations on raster images (JPEG, PNG, TIFF) and PDFs converted to images.
correcting common document image issues such as skew, perspective distortion, uneven lighting, and noise.
control over image resolution and text size through resampling and resizing, which is critical for meeting OCR engine requirements.
improving text contrast and clarity, including conversion to grayscale, binarization (black and white), contrast stretching, and sharpening.
removing unwanted elements like dark borders, transparency, and background artifacts that can confuse OCR engines.
supports scripting and batch processing.
ImageMagick Limitations
ImageMagick has specific limitations that must be considered:
ImageMagick is purely a preprocessing tool and must be paired with an OCR engine like Tesseract or PaddleOCR for text recognition.
Achieving optimal results requires expertise and experimentation with its extensive command-line options. Incorrect settings can degrade image quality and harm OCR accuracy.
Heavy preprocessing pipelines (e.g., multiple filter passes, high-resolution resampling) can be resource-intensive and slow down processing.
There is no universal setting. The optimal preprocessing chain is highly dependent on the initial quality and characteristics of the source documents.
ImageMagick Key Parameters and Recommended Settings
In EasyRPA, ImageMagick is configured within the "imageMagickOptions" array in the JSON Document Set Settings. It is used as a preprocessing step for OCR engines like Tesseract and PaddleOCR.
| Parameter | Purpose & Use Case | Recommended Setting |
|---|---|---|
| -density | Sets the image's resolution (DPI). Crucial for ensuring text is at the correct size for OCR. Must be specified before the ${source} in the JSON configuration to be applied to the input image. | 300 (meets Tesseract's recommended DPI). |
| -resample | Changes the image's resolution (DPI) (oftern used after the image has been loaded into memory). Useful for resizing an image for better OCR accuracy. | 300 (matches the density setting for consistency). |
| -deskew | Automatically corrects slight page skew. | 40% (works for most images with minor skew). |
| -modulate | Adjusts brightness, saturation, and hue. Effective for cleaning colored/uneven backgrounds. | "160,0,100" (increases brightness and removes color). |
| -contrast-stretch | Enhances the contrast between text and background. | "1x1%" (a gentle stretch to improve clarity). |
| -unsharp | Sharpens the image edges, making character boundaries clearer. | "0x1" (a light sharpening filter). |
| -threshold | Converts a grayscale image to pure black and white (binarization). | 50% (a starting point; adaptive methods are often better). |
| -trim / -shave | Removes borders or unwanted white space around the core content. | -trim +repage (automatic), -shave 50x50 (manual). |
| -alpha remove | Removes transparency, filling it with a background color (usually white). | Essential for PDF and PNG with transparent backgrounds. |
Sample ImageMagick Settings (as part of a Tesseract pipeline):
{
"bucket": "data",
"ocrType": "tesseract",
"imageMagickOptions": [
"-units",
"PixelsPerInch",
"-density",
"300",
"${source}",
"-background",
"white",
"-alpha",
"remove",
"-deskew",
"40%",
"-modulate",
"160,0,100",
"-unsharp",
"0x1",
"-resample",
"300"
],
"tesseractOptions": [
"-l",
"eng",
"-psm",
"3"
]
}-density parameter can be used to set the image resolution (dots per inch) before the input image is loaded into memory. This parameter ensures the image is interpreted at the specified resolution when it is first read. To be applied to the input image this parameter must be declared before the ${source} in the JSON configuration file for it to take effect.
It is recommended to use -resample to change the image resolution after it has been loaded into memory. In this case it has to be declared after the ${source} in the JSON configuration.
Tesseract
Tesseract is an open-source Optical Character Recognition (OCR) engine sponsored by Google. In EasyRPA, it is the core component for "reading" text from images that have been preprocessed by tools like ImageMagick. It converts raster images containing text into machine-encoded strings.
Tesseract Capabilities
The key capabilities of Tesseract OCR include:
It extracts text from image-based sources such as scanned documents, PDFs converted to images, and photographs.
It supports over 100 languages out-of-the-box, with the ability to recognize multiple languages in a single document.
It can output results in various formats, including plain text, hOCR (HTML with positional information), and PDF.
The engine includes advanced features like automatic page segmentation and script recognition.
It offers multiple OCR Engine Modes (OEM), including a neural network-based LSTM engine (OEM 3).
It provides configurable Page Segmentation Modes (PSM) to optimize recognition for different document layouts, from single text blocks to sparse text.
Tesseract Limitations
Tesseract has specific limitations that must be considered:
Tesseract's accuracy is highly dependent on the quality of the input image. Poor resolution, skew, noise, or low contrast will lead to significantly worse results.
It can struggle with text embedded in complex images, heavy background patterns, or colored backgrounds without proper preprocessing to binarize the image.
While excellent with common fonts, accuracy can drop with decorative or very stylized fonts. It is primarily trained for printed text.
Highly complex, non-standard, or multi-oriented layouts (e.g., text in multiple directions on the same page) can cause errors.
Tesseract Key Parameters and Recommended Settings
In EasyRPA, Tesseract is activated by specifying "ocrType": "tesseract" and configured within the "tesseractOptions" array in the JSON Document Set Settings.
| Parameter | Purpose & Use Case | Recommended Setting |
|---|---|---|
| -l | Specifies the language(s) of the text to be recognized. | "eng" (English). For multiple languages: "eng+fra" (English and French). |
| -psm | Page Segmentation Mode. Tells Tesseract how to analyze the page layout. | 3: Fully automatic page segmentation (default). |
| --oem | OCR Engine Mode. Selects the underlying engine. | 3: Default, uses the LSTM (Long Short-Term Memory) engine (recommended for best accuracy). |
| --dpi | Informs Tesseract of the input image's resolution. Should match the preprocessed image's DPI. | 300 (if the image was resampled to 300 DPI in ImageMagick). |
| -c | Sets a Tesseract configuration variable for fine-grained control. | Example: preserve_interword_spaces=1 (ensures spaces between words are preserved). |
Sample Tesseract settings (including ImageMagick preprocessing):
{
"bucket": "data",
"ocrType": "tesseract",
"imageMagickOptions": [
"-units",
"PixelsPerInch",
"-density",
"300",
"${source}",
"-background",
"white",
"-alpha",
"remove",
"-quality",
"100",
"-resample",
"300"
],
"tesseractOptions": [
"-l",
"eng",
"-psm",
"12",
"--oem",
"3",
"--dpi",
"300"
]
}PaddleOCR
PaddleOCR is a an open-source OCR system developed by Baidu. Integrated into EasyRPA, it can extract text from text images and scanned documents. PaddleOCR uses deep learning models to "read" text from raster images and is highly efficient for processing scanned PDFs, photographs, and documents where the text is not digitally accessible.
PaddleOCR Capabilities
The key capabilities of PaddleOCR include:
- It extracts text from image-based sources.
It provides high accuracy for recognizing text in multiple languages (covers 39 languages), including English, Russian, German, French, among others.
It performs a complete OCR pipeline: text detection (finding text regions), text recognition (reading the text in those regions), and optional textline orientation correction.
The engine includes advanced preprocessing capabilities, such as document image unwarping and textline orientation classification, which can correct for skewed or curved pages automatically.
It supports multiple OCR versions (PP-OCRv3, v4, v5) and allows for the use of custom-trained models for specific use cases.
It supports GPU acceleration, TensorRT, and MKL-DNN for fast inference, making it suitable for processing large volumes of documents.
Official Documentation: PaddleOCR GitHub Repository
PaddleOCR Limitations
PaddleOCR has specific limitations that must be considered:
PaddleOCR can typically only recognize one language per run when using the standard --lang parameter. For example, it can be set to recognize English (en) or German (ge), but not both simultaneously in a single operation.
Computational resource requirements: achieving high speed requires a GPU. CPU-only processing, especially with high-resolution images, can be significantly slower.
May require image preprocessing (ImageMagick): like most OCR engines it can struggle with extremely poor image quality, heavy background patterns, or severe text artefacts where human reading is also difficult.
PaddleOCR Key Parameters and Recommended Settings
In EasyRPA, PDFBox is activated by specifying ocrType: "paddleocr" in JSON Document Set Settings, settings are configured within the "paddleOcrOptions" array. Some useful parameters for tuning performance and accuracy within EasyRPA are:
| Parameter | Purpose & Use Case | Recommended Setting |
|---|---|---|
| --lang | Specifies the language of the text to be recognized. Only one language can be specified per run. | en (English), ge (German), etc. Must match the primary language in the document. |
| --ocr_version | Selects the version of the pre-trained PP-OCR model. Newer versions generally offer better accuracy and speed. | PP-OCRv5 (latest), PP-OCRv4 (if compatibility is an issue). |
| --use_doc_unwarping | Corrects curvature and warping in images of pages, useful for photos of book pages or folded documents. | True for images taken with a camera; False for flatbed scans. |
| --text_det_unclip_ratio | Controls the size of the bounding box around detected text. A larger value can help with recognizing characters with long ascenders/descenders (e.g., 'p', 'q', 'l'). | Start with 0.8. Increase to 1.8 or 2.0 if text is being cut off; decrease if boxes are merging. |
| --text_det_box_thresh | The confidence threshold for a detected region to be considered a text box. A higher value returns fewer but more confident detections. | 0.5 (default). Lower to 0.3 to detect faint text; increase to 0.7 to reduce false positives on complex backgrounds. |
| --text_det_thresh | The pixel-level threshold for the text detection model. Filters out low-confidence pixels before forming text boxes. | 0.3 (default). Adjust in conjunction with text_det_box_thresh. |
| --cpu_threads | Number of CPU threads to use for inference. Crucial for performance on CPU-only systems. | Set to the number of available physical CPU cores (e.g., 4 or 10). Do not exceed the total core count. |
| --device | Selects the hardware for inference. | gpu:0 (if GPU is available), cpu (otherwise). |
You can also specify debug: "paddleocr" to generate paddle ocr json pocr and debug tainput.
Sample PaddleOCR Settings (including ImageMagick preprocessing):
{
"bucket": "data",
"ocrType": "paddleocr",
"imageMagickOptions": [
"-units",
"PixelsPerInch",
"-density",
"300",
"${source}",
"-background",
"white",
"-alpha",
"remove",
"-quality",
"100",
"-resample",
"300"
],
"paddleOcrOptions": [
"--lang",
"en",
"--text_det_unclip_ratio",
"0.9",
"--use_doc_unwarping",
"false",
"--cpu_threads",
"4"
],
"debug": [
"images",
"keepFiles",
"paddleocr"
]
}For specialized tasks, you can use models fine-tuned for specific scenarios. When using --text_detection_model_name and --text_recognition_model_name, the --lang and --ocr_version parameters are ignored.
{
"ocrType": "paddleocr",
"paddleOcrOptions": [
"--text_recognition_model_name",
"eslav_PP-OCRv5_mobile_rec",
"--text_detection_model_name",
"PP-OCRv5_server_det"
]
}PaddleOCR has robust internal preprocessing. Heavy external preprocessing (like sharpening, contrast adjustment) is often unnecessary and can sometimes degrade performance. The primary preprocessing steps should be:
- Split PDF to Pages: Convert the document into individual page images.
- Set Appropriate DPI: Use ImageMagick (or Ghostscript) to ensure input images have a sufficient resolution (e.g., 300 DPI). Low DPI will result in poor recognition, while very high DPI will slow down processing without significant accuracy gains.
PDFBox
Apache PDFBox is an open-source Java library integrated into EasyRPA for extracting digitally-created text from readable PDF documents. It is the preferred and most efficient tool for processing PDFs that contain selectable text, as it directly accesses the text characters and their coordinates within the PDF file.
PDFBox Capabilities
PDFBox extracts text and structural information from PDFs that are natively digital ("searchable" PDFs). Its primary functions for information extraction include:
PDFBox does not perform Optical Character Recognition (OCR). Instead, it parses the PDF's internal structure to extract text that was digitally created by software (Microsoft Word, Adobe InDesign, etc). This makes the extraction process very accurate, preserving the original text without OCR-related errors.
The library automatically splits the PDF document into individual pages. For each page, it extracts the text along with precise bounding boxes (bbox) for each character, word, or line of text. These coordinates are essential for locating the extracted information within the context of the original page layout.
PDFBox Limitations
PDFBox has specific limitations that must be considered when selecting an extraction tool:
The most significant limitation is that PDFBox cannot extract text from images or raster graphics embedded within a PDF. If a document is a scanned image of a text page or contains pictures, screenshots, or diagrams with critical text, PDFBox will ignore that content entirely.
PDFBox does not interpret text that is part of vector graphics (e.g., diagrams, charts, or logos created in SVG or similar formats). Text within these elements is not recognized as extractable content.
PDFBox Key Parameters and Recommended Settings
In EasyRPA, PDFBox is activated by specifying ocrType: "pdfbox" in JSON Document Set Settings. Other PDFBox parameters include:
pdfBoxWordSeparationGapPt (default: 6.0): Searchable PDFs often contain lines of text. This parameter defines the maximum gap (in points) between symbols that PDFBox will use to split a continuous line of text into individual words. Adjusting this value can improve word segmentation accuracy (optional).
pdfBoxDpi (default: 320): This setting controls the resolution for generating page images for the internal viewer and does not impact the core text extraction process (required).
debug: Including "pdfbox" in the debug array will generate debug images to aid in troubleshooting the extraction process (optional).
Sample PDFBox Settings:
{
"bucket": "data",
"ocrType": "pdfbox",
"pdfBoxWordSeparationGapPt": 6,
"pdfBoxDpi": 180,
"debug": [
"pdfbox"
]
}For more information about OCR settings and configurable pipelines, please refer to Built-in OCR.