OCR Task
OCR Task
OCR Overview
OCR (Optical Character Recognition) is the electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document or from subtitle text superimposed on an image (for example, from a television broadcast).
OCR of input documents is a separate step of the development process - usually, this is the conversion from PDF into Text documents. The input quality of original documents influences a lot of the output of this process.
EasyRPA uses Tesseract OCR, which is integrated into separate task and proposes an OOTB (out-of-the-box) solution: there is no necessity to write or maintain any custom code.
OCR Task Usage
Now, let us give you a few steps necessary to implement this approach.
Define OcrTaskData as output in your Task class:
@Output(OcrTask.OCR_TASK_DATA_KEY) private OcrTaskData ocrTaskData;
OcrTask.OCR_TASK_DATA_KEY - input/output data key. OcrTask uses this key to get input data for OCR.
Prepare OCR configuration providing S3 bucket name, Tesseract options and Image Magick options:
Map configuration = new HashMap<String, Object>(); configuration.put("bucket", "data"); configuration.put("ocrImageType", "png"); configuration.put("tesseractOptions", Arrays.asList("-l", "eng","--psm","3","--oem","3","--dpi", "800")); configuration.put("imageMagickOptions", Arrays.asList("-resample", "450", "-density", "350", "-quality", "100", "-background" , "white" , "-alpha" , "flatten")); configuration.put("hocrFixWords", Map.of("S(?=[0-9]+)", "\\$")); configuration.put("imagePostprocessScriptsBucket", "data/ap/scripts"); configuration.put("imagePostprocessScripts", Map.of( "textcleaner", Arrays.asList("-g", "-e", "stretch", "-f", "25", "-o", "20", "-t", "30", "-u", "-s", "1", "-T", "-p", "20")));
documents_bucket - the S3 bucket name which OCR will use to save results.
tesseractOptions - Tesseract command line options. "-l" and "eng" means language = eng etc. Please see the external documentation Tesseract Command Line Usage.
imageMagickOptions - OcrTask uses ImageMagick tool to split pdf by pages and print as images. You can provide a command line options to change some command line arguments. Please follow the ImageMagick Command Line documentation.
hocrFixWords - HocrPostProcessor uses this config to fix OCR mistakes. Postprocessor accepts regular expression and replaces all matches by value. Pay attention that dollar sign and backslash( ) should be escaped in value string.imagePostprocessScripts - optional image post-processing scripts. If specified will be run against image magic split results. Optional imagePostprocessScriptsBucket parameter allows providing scripts from and AP managed S3 bucket.
ocrImageType - optional image type to be used in the ImageMagick and postprocessing pipeline. By default it is 'jpg'. Note: only 'jpg' and 'png' types are supported OOTB, formats like 'tiff' would require extra setup of the browser for the Human Task view to display it correctly.
To initialize OcrTaskData you need to pass the S3 path to the particular document, add formats which will be used for getting OCR results and provide OCR configuration:
this.ocrTaskData = new OcrTaskData(); ocrTaskData.setDocumentLocation("ie_demo_invoice/invoice101.pdf"); ocrTaskData.getFormats().addAll(Arrays.asList(OcrFormats.TEXT, OcrFormats.HOCR, OcrFormats.JSON, OcrFormats.IMAGE)); ocrTaskData.setConfiguration(configuration);
document location - the document path on S3 Storage.
OcrFormats - the list of OCR formats which OCR Task is currently supported.
configuration - the configuration map which you must set from the 2nd item.Call OcrTask:
TaskOutput ocrInput = execute(getInput(), PrepareOcrTask.class).get(); TaskOutput ocrOutput = execute(ocrInput, OcrTask.class).get();
To get OCR use following example:
@ApTaskEntry(name = "Store OCR Result") @Slf4j @InputToOutput public class StoreOcrResult extends ApTask { @Input(OcrTask.OCR_TASK_DATA_KEY) private OcrTaskData ocrTaskData; @Override public void execute() { log.info("Received response OCR Task {} ", ocrTaskData.getTaskUuid()); List<OcrResult> ocrResults = this.ocrTaskData.getOcrResults(); //... } }
OcrTask.OCR_TASK_DATA_KEY - input/output data key. OcrTask uses this key to get input data for OCR.