Preface
Step 1. OOTB Document Processing
- Retrieve documents
- Reuse OOTB code with no flow modification
Step 2. OOTB Modification

Preface

For the ML Automation Process we are going to use platform Document Processors a base.

In our step-by-step example we're going to implement EasyRPA Storage scanning for new incoming invoices.

The full completed project can be obtained from the samples GIT https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/confluence-samples/ie-sample-dp

In the first step, you will need to download the archetype. The following link provides more information on how to do this: Generate project from archetype. In the generated from archetype project, the following should be added to pom.xml:

pom.xml

	 <dependencies>
		 . . . .
		<dependency>
			<groupId>eu.ibagroup</groupId>
			<artifactId>easy-rpa-aps</artifactId>
			<version>${rpaplatform.version}</version>
		 </dependency>
		 . . . .
	<dependencies>

Second thing we need to do - define configuration for the AP run:

set datastore name for documents
set storage path for files
set run model
set OCR options

The platform properties file:

apm_run.properties

# RpaPlatform Client Configuration File
# node configuration settings
SELENIUM_HUB_URL=http://localhost:4444/wd/hub
# EasyRPA Client Configuration File
inputFolder=ie_sample_dp/input
fileFilter=.*\\.pdf
removeImported=false
configuration={"DEFAULT": {"dataStore": "IE_SAMPLE_DP_DOCUMENTS","documentType": "Products invoice","model": "idp_sample_invoice","runModel": "idp_sample_invoice","storagePath": "idp_sample","exportDocumentSet": "IE_SAMPLE_DP_DOCUMENTS","bucket": "data","tesseractOptions": ["-l", "eng", "--psm", "12", "--oem", "3", "--dpi", "150"],"imageMagickOptions": ["-units", "PixelsPerInch", "-resample", "150", "-density", "150", "-quality", "100", "-background", "white", "-deskew", "40%", "-contrast", "-alpha", "flatten"]}}
remoteExecutionService.enableLocalExecution=true

Step 1. OOTB Document Processing

Retrieve documents

Basically, the Information Extraction process begins with the document retrieving step. Below is the list of data source examples and suggestions on how to work with them.

#	Data Source	Suggestions
1	Emails in some mailbox	Depends on the protocol you can access the mailbox you need to use different libraries to scan it. The most popular 2 protocols are: IMAP Exchange EasyRPA provides an Email Client utility that covers both protocols and can be used for scanning and sending emails. You can scan emails using necessary search terms like some keyword in the subject, emails from a specific sender, emails in the date range, etc.
2	Files from Shared Network Folder	Sometimes you need to scan a specific folder in the network drive for some files to appear. There're 2 ways you can access the network folders: Using the Samba protocol client for Java (e.g. https://github.com/AgNO3/jcifs-ng). You can map a network drive using the Windows feature (https://support.microsoft.com/en-us/windows/map-a-network-drive-in-windows-10-29ce55d1-34e3-a7e2-4801-131475f9557d) so after that, you can access the network folder in the same way as a local file system.
3	Files from EasyRPA file storage	It's a good option if you can agree with the business process operators to put target documents for processing into EasyRPA file storage. Then you can scan it using the existing Storage Manager
4	Files from FTP	It's also possible that you need to scan the FTP server to retrieve incoming documents. So you need to use some FTP client library for Java (e.g. Apache Commons Net). Also pay attention, that sometimes FTP servers have a simple Web UI interface, so you can access the files using common HTTP Get requests.

Currently the platform OOTB provides the following tasks for obtaining documents:

ImportDocumentFromStorageTask - for #3 without any code
ImportDocumentFromFileTask - for #1,2,4 but with coding, because the ImportDocumentFromFileTask operates with file only, you have to use the 3d party libraries to obtain file and then use the task.

Reuse OOTB code with no flow modification

The Automation Process (AP) class will looks like the following:

InvoiceProcessingSample_1.java

package eu.ibagroup.samples.iedp;

import eu.ibagroup.easyrpa.ap.dp.DocumentProcessor;
import eu.ibagroup.easyrpa.ap.iedp.IeDocumentProcessorBase;
import eu.ibagroup.easyrpa.engine.annotation.ApModuleEntry;
import eu.ibagroup.samples.iedp.entity.IeDocument;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.Logger;

@ApModuleEntry(name = "InvoiceProcessingSample")
@Slf4j
public class InvoiceProcessingSample_1 extends DocumentProcessor<IeDocument> implements IeDocumentProcessorBase {

	@Override
	public Logger log() {
		return log;
	}

}

The entity, we define new 'invoiceNumber' field to be filled with the extracted invoice numbers:

IeDocument.java

package eu.ibagroup.samples.iedp.entity;

import eu.ibagroup.easyrpa.ap.dp.entity.DpDocument;
import eu.ibagroup.easyrpa.persistence.annotation.Column;
import eu.ibagroup.easyrpa.persistence.annotation.Entity;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.ToString;

@Data
@NoArgsConstructor
@AllArgsConstructor
@Entity(value = "IE_SAMPLE_DP_DOCUMENTS")
@ToString
public class IeDocument extends DpDocument {

	@Column("invoice_number")
	private String invoiceNumber;

}

The repository:

IeDocumentRepository.java

package eu.ibagroup.samples.iedp.repository;

import eu.ibagroup.easyrpa.ap.dp.repository.DpDocumentRepository;
import eu.ibagroup.samples.iedp.entity.IeDocument;

public interface IeDocumentRepository extends DpDocumentRepository<IeDocument> {
}

The custom postprocessor, to fill entity 'invoiceNumber' field from an extracted invoice number after ML :

InvoicePostprocessor.java

package eu.ibagroup.samples.iedp.postprocessors;

import javax.inject.Inject;

import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorMethod;
import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorStrategies;
import eu.ibagroup.easyrpa.ap.dp.postprocessing.BasePostProcessor;
import eu.ibagroup.easyrpa.ap.iedp.postprocessing.IePostProcessorBase;
import eu.ibagroup.easyrpa.persistence.documentset.DocumentSet;
import eu.ibagroup.samples.iedp.entity.IeDocument;
import eu.ibagroup.samples.iedp.repository.IeDocumentRepository;
import eu.ibagroup.samples.iedp.transformation.IeDocumentTransformation;
import lombok.Getter;

@PostProcessorStrategies("ie")
public class InvoicePostprocessor extends BasePostProcessor<IeDocument> implements IePostProcessorBase<IeDocument> {

	@Inject
	@Getter
	private IeDocumentRepository documentRepository;

	@PostProcessorMethod("ieSampleStoreIeMlTaskResults")
	public void idpSampleStoreClMlTaskResults() {

		log().debug("Storing response from IE ML Task for document {} ", documentContext().getDocumentId());
		IeDocument document = documentContext().getDocument();

		String invoiceNumber = getExtractedEntities().getValue(IeDocumentTransformation.INVOICE_NUMBER);
		document.setInvoiceNumber(invoiceNumber);

		documentContext().updateDocument(DocumentSet.Status.TAGGED_BY_MODEL);
	}

}

The custom validator, to validate extracted invoice amounts after ML :

InvoiceValidator.java

package eu.ibagroup.samples.iedp.validators;

import java.math.BigDecimal;
import java.math.RoundingMode;

import javax.inject.Inject;

import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorMethod;
import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorStrategies;
import eu.ibagroup.easyrpa.ap.dp.postprocessing.BasePostProcessor;
import eu.ibagroup.easyrpa.ap.dp.validation.ValidationMessage;
import eu.ibagroup.samples.iedp.entity.IeDocument;
import eu.ibagroup.samples.iedp.repository.IeDocumentRepository;
import eu.ibagroup.samples.iedp.to.InvoiceItemTO;
import eu.ibagroup.samples.iedp.to.InvoiceTO;
import eu.ibagroup.samples.iedp.to.TaxRateTO;
import eu.ibagroup.samples.iedp.transformation.IeDocumentTransformation;
import lombok.Getter;

@PostProcessorStrategies("ie")
public class InvoiceValidator extends BasePostProcessor<IeDocument> implements IeDocumentTransformation {

	@Inject
	@Getter
	private IeDocumentRepository documentRepository;

	@PostProcessorMethod("ieSampleValidateDocumentAmounts")
	public void idpSampleValidateInvoiceAmounts() {
		InvoiceTO invoice = getInvoice();

		BigDecimal total = new BigDecimal(0);
		for (int i = 0; i < invoice.getItems().size(); i++) {
			InvoiceItemTO item = invoice.getItems().get(i);
			if (isEmpty(item.getPrice())) {
				addMessages(ValidationMessage.error(PRICE + " '" + i + "' should not be empty "));
			} else {
				total = total.add(item.getPrice().multiply(optional(item.getQuantity()).orElse(BigDecimal.valueOf(1))));
			}
		}

		BigDecimal invoiceTax = optional(invoice.getInvoiceTax()).orElse(new TaxRateTO(BigDecimal.valueOf(0))).getPercent();
		total = total.add(total.multiply(invoiceTax).divide(BigDecimal.valueOf(100)).setScale(2, RoundingMode.HALF_UP));

		BigDecimal finalTotal = total;
		BigDecimal invoiceDiscountAmount = optional(invoice.getInvoiceDiscountTax())
				.map(t -> finalTotal.multiply(t).divide(BigDecimal.valueOf(100)).setScale(2, RoundingMode.HALF_UP))
				.orElse(optional(invoice.getInvoiceDiscount()).orElse(BigDecimal.valueOf(0)));
		total = total.subtract(invoiceDiscountAmount);

		if (total.compareTo(invoice.getAmount()) != 0) {
			addMessages(ValidationMessage.error(TOTAL_AMOUNT + " should be " + total + " instead of " + invoice.getAmount()));
		}
	}

}

The document type JSON extended with custom postprocessor and validator definition:

Products invoice.json

{
	"importStrategy": "OVERRIDE",
	"name": "Products invoice",
	"description": "IE Sample Invoice Document",
	"humanTaskTypeName": "Information Extraction Task",
	"settings": {
	"appLanguage": "en",
	"taskTypeLabel": "Products Invoice Document Information Extraction",
	"taskInstructionText": "Please extract fields from provided document.",
	"allowCustomValue": true,
	"excludeUndefinedEntities": true,
	"categories": [
		{
		"name": "Invoice Number",
		"multiple": false,
		"required": true,
		"hotkey": [
			"n"
		]
		},
		......
		{
		"name": "Total Amount",
		"multiple": false,
		"required": false,
		"hotkey": [
			"a"
		]
		}
	],
	"mlPostProcessors": [
		{
		"name": "ieSampleStoreIeMlTaskResults"
		}
	],
	"validators": [
		{
		"name": "ieSampleValidateDocumentAmounts"
		}
	]
	}
}

Only with these 5 classes above and a change to document type, a full-featured document processing is ready to be executed.

Below are the steps that included into the OOTB document processing:

obtain files from storage
creates a document record for every file in datastore IE_SAMPLE_DP_DOCUMENT
OCR the document
call ML task for document
do postprocessing
do validation
call HT for document (if validation failed)
save all data into document record

If amounts validation fails (total amount does not match to the calculation result) - Human Task is created for the human to fix the extraction errors manually:

In this particular case Human Task can easily be avoided with the OOTB "regexReplacement" postprocessors - they will fix most typical OCR errors automatically:

Products invoice.json

	......
	"mlPostProcessors": [
		{
		"entityName": "Price",
		"name": "regexReplacement",
		"rules": {
			"o|O|e|c|C|Q|p|P": "0",
			"I|i|j|l": "1",
			"b|G": "6",
			"B": "8",
			"q": "9"
		}
		},
		{
		"entityName": "Total Amount",
		"name": "regexReplacement",
		"rules": {
			"o|O|e|c|C|Q|p|P": "0",
			"I|i|j|l": "1",
			"b|G": "6",
			"B": "8",
			"q": "9"
		}
		},
		{
		"entityName": "Total Discount",
		"name": "regexReplacement",
		"rules": {
			"o|O|e|c|C|Q|p|P": "0",
			"I|i|j|l": "1",
			"b|G": "6",
			"B": "8",
			"q": "9"
		}
		},		
		 {
		"name": "ieSampleStoreIeMlTaskResults"
		}
	],
	"validators": [
		{
		"name": "ieSampleValidateDocumentAmounts"
		}
	]
	}
}

The AP run result in the IE_SAMPLE_DP_DOCUMENTS :

Step 2. OOTB Modification

Let's customize default OOTB Document Proccessing flow - add extra 'GetInvoiceResult' step after ML that persist extracted invoice number and validation result into data store:

InvoiceProcessingSample_2.java

package eu.ibagroup.samples.iedp;

import eu.ibagroup.easyrpa.ap.dp.DocumentProcessor;
import eu.ibagroup.easyrpa.ap.iedp.IeDocumentProcessorBase;
import eu.ibagroup.easyrpa.engine.annotation.ApModuleEntry;
import eu.ibagroup.easyrpa.engine.apflow.TaskInput;
import eu.ibagroup.easyrpa.engine.apflow.TaskOutput;
import eu.ibagroup.samples.iedp.entity.IeDocument;
import eu.ibagroup.samples.iedp.task.GetInvoiceResult;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.Logger;

import java.util.concurrent.CompletableFuture;

@ApModuleEntry(name = "IE DP Sample (Step 2)")
@Slf4j
public class InvoiceProcessingSample_2 extends DocumentProcessor<IeDocument> implements IeDocumentProcessorBase {

	@Override
	public Logger log() {
		return log;
	}

	@Override
	public CompletableFuture<TaskOutput> processDocument(TaskInput docInput) {
		return super.processDocument(docInput).thenCompose(execute(GetInvoiceResult.class));
	}

}

We are obtaining model output json from document, check for errors, and obtain "Invoice Number". Here is GetInvoiceResult code:

GetInvoiceResult.java

package eu.ibagroup.samples.iedp.task;

import eu.ibagroup.easyrpa.ap.dp.tasks.to.ContextId;
import eu.ibagroup.easyrpa.ap.dp.tasks.to.DocumentId;
import eu.ibagroup.easyrpa.ap.iedp.validation.IeValidatorBase;
import eu.ibagroup.easyrpa.engine.annotation.ApTaskEntry;
import eu.ibagroup.easyrpa.engine.annotation.InputToOutput;
import eu.ibagroup.easyrpa.persistence.documentset.DocumentSet;
import eu.ibagroup.samples.iedp.entity.IeDocument;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.Logger;

@ApTaskEntry(name = "Get Invoice Result")
@Slf4j
@InputToOutput(value = { ContextId.KEY, DocumentId.KEY })
public class GetInvoiceResult extends IeDocumentTask implements IeValidatorBase<IeDocument> {

	@Override
	public Logger log() {
		return log;
	}

	@Override
	public void execute() {
		IeDocument document = documentContext().getDocument();
		String invoiceNumber = getExtractedEntities().getValue("Invoice Number");
		boolean isValid = isValidDocument();
		document.setInvoiceNumber(invoiceNumber);
		document.setIsValidInvoice(isValid);

		documentContext().updateDocument(DocumentSet.Status.TAGGED_BY_MODEL);
	}

}

The run history has our tasks in the document processing flow: