Develop Automation Process (IE) using Document Processors
Develop Automation Process (IE) using Document Processors
Preface
For the ML Automation Process we are going to use platform Document Processors a base.
In our step-by-step example we're going to implement EasyRPA Storage scanning for new incoming invoices.
The full completed project can be obtained from the samples GIT https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/confluence-samples/ie-sample-dp
In the first step, you will need to download the archetype. The following link provides more information on how to do this: Generate project from archetype. In the generated from archetype project, the following should be added to pom.xml:
<dependencies> . . . . <dependency> <groupId>eu.ibagroup</groupId> <artifactId>easy-rpa-aps</artifactId> <version>${rpaplatform.version}</version> </dependency> . . . . <dependencies>
Second thing we need to do - define configuration for the AP run:
- set datastore name for documents
- set storage path for files
- set run model
- set OCR options
The platform properties file:
Step 1. OOTB Document Processing
Retrieve documents
Basically, the Information Extraction process begins with the document retrieving step. Below is the list of data source examples and suggestions on how to work with them.
# | Data Source | Suggestions |
---|---|---|
1 | Emails in some mailbox | Depends on the protocol you can access the mailbox you need to use different libraries to scan it. The most popular 2 protocols are:
EasyRPA provides an Email Client utility that covers both protocols and can be used for scanning and sending emails. You can scan emails using necessary search terms like some keyword in the subject, emails from a specific sender, emails in the date range, etc. |
2 | Files from Shared Network Folder | Sometimes you need to scan a specific folder in the network drive for some files to appear. There're 2 ways you can access the network folders:
|
3 | Files from EasyRPA file storage | It's a good option if you can agree with the business process operators to put target documents for processing into EasyRPA file storage. Then you can scan it using the existing Storage Manager |
4 | Files from FTP | It's also possible that you need to scan the FTP server to retrieve incoming documents. So you need to use some FTP client library for Java (e.g. Apache Commons Net). Also pay attention, that sometimes FTP servers have a simple Web UI interface, so you can access the files using common HTTP Get requests. |
Currently the platform OOTB provides the following tasks for obtaining documents:
- ImportDocumentFromStorageTask - for #3 without any code
- ImportDocumentFromFileTask - for #1,2,4 but with coding, because the ImportDocumentFromFileTask operates with file only, you have to use the 3d party libraries to obtain file and then use the task.
Reuse OOTB code with no flow modification
The Automation Process (AP) class will looks like the following:
package eu.ibagroup.samples.iedp; import eu.ibagroup.easyrpa.ap.dp.DocumentProcessor; import eu.ibagroup.easyrpa.ap.iedp.IeDocumentProcessorBase; import eu.ibagroup.easyrpa.engine.annotation.ApModuleEntry; import eu.ibagroup.samples.iedp.entity.IeDocument; import lombok.extern.slf4j.Slf4j; import org.slf4j.Logger; @ApModuleEntry(name = "InvoiceProcessingSample") @Slf4j public class InvoiceProcessingSample_1 extends DocumentProcessor<IeDocument> implements IeDocumentProcessorBase { @Override public Logger log() { return log; } }
The entity, we define new 'invoiceNumber' field to be filled with the extracted invoice numbers:
package eu.ibagroup.samples.iedp.entity; import eu.ibagroup.easyrpa.ap.dp.entity.DpDocument; import eu.ibagroup.easyrpa.persistence.annotation.Column; import eu.ibagroup.easyrpa.persistence.annotation.Entity; import lombok.AllArgsConstructor; import lombok.Data; import lombok.NoArgsConstructor; import lombok.ToString; @Data @NoArgsConstructor @AllArgsConstructor @Entity(value = "IE_SAMPLE_DP_DOCUMENTS") @ToString public class IeDocument extends DpDocument { @Column("invoice_number") private String invoiceNumber; }
The repository:
package eu.ibagroup.samples.iedp.repository; import eu.ibagroup.easyrpa.ap.dp.repository.DpDocumentRepository; import eu.ibagroup.samples.iedp.entity.IeDocument; public interface IeDocumentRepository extends DpDocumentRepository<IeDocument> { }
The custom postprocessor, to fill entity 'invoiceNumber' field from an extracted invoice number after ML :
package eu.ibagroup.samples.iedp.postprocessors; import javax.inject.Inject; import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorMethod; import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorStrategies; import eu.ibagroup.easyrpa.ap.dp.postprocessing.BasePostProcessor; import eu.ibagroup.easyrpa.ap.iedp.postprocessing.IePostProcessorBase; import eu.ibagroup.easyrpa.persistence.documentset.DocumentSet; import eu.ibagroup.samples.iedp.entity.IeDocument; import eu.ibagroup.samples.iedp.repository.IeDocumentRepository; import eu.ibagroup.samples.iedp.transformation.IeDocumentTransformation; import lombok.Getter; @PostProcessorStrategies("ie") public class InvoicePostprocessor extends BasePostProcessor<IeDocument> implements IePostProcessorBase<IeDocument> { @Inject @Getter private IeDocumentRepository documentRepository; @PostProcessorMethod("ieSampleStoreIeMlTaskResults") public void idpSampleStoreClMlTaskResults() { log().debug("Storing response from IE ML Task for document {} ", documentContext().getDocumentId()); IeDocument document = documentContext().getDocument(); String invoiceNumber = getExtractedEntities().getValue(IeDocumentTransformation.INVOICE_NUMBER); document.setInvoiceNumber(invoiceNumber); documentContext().updateDocument(DocumentSet.Status.TAGGED_BY_MODEL); } }
The custom validator, to validate extracted invoice amounts after ML :
package eu.ibagroup.samples.iedp.validators; import java.math.BigDecimal; import java.math.RoundingMode; import javax.inject.Inject; import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorMethod; import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorStrategies; import eu.ibagroup.easyrpa.ap.dp.postprocessing.BasePostProcessor; import eu.ibagroup.easyrpa.ap.dp.validation.ValidationMessage; import eu.ibagroup.samples.iedp.entity.IeDocument; import eu.ibagroup.samples.iedp.repository.IeDocumentRepository; import eu.ibagroup.samples.iedp.to.InvoiceItemTO; import eu.ibagroup.samples.iedp.to.InvoiceTO; import eu.ibagroup.samples.iedp.to.TaxRateTO; import eu.ibagroup.samples.iedp.transformation.IeDocumentTransformation; import lombok.Getter; @PostProcessorStrategies("ie") public class InvoiceValidator extends BasePostProcessor<IeDocument> implements IeDocumentTransformation { @Inject @Getter private IeDocumentRepository documentRepository; @PostProcessorMethod("ieSampleValidateDocumentAmounts") public void idpSampleValidateInvoiceAmounts() { InvoiceTO invoice = getInvoice(); BigDecimal total = new BigDecimal(0); for (int i = 0; i < invoice.getItems().size(); i++) { InvoiceItemTO item = invoice.getItems().get(i); if (isEmpty(item.getPrice())) { addMessages(ValidationMessage.error(PRICE + " '" + i + "' should not be empty ")); } else { total = total.add(item.getPrice().multiply(optional(item.getQuantity()).orElse(BigDecimal.valueOf(1)))); } } BigDecimal invoiceTax = optional(invoice.getInvoiceTax()).orElse(new TaxRateTO(BigDecimal.valueOf(0))).getPercent(); total = total.add(total.multiply(invoiceTax).divide(BigDecimal.valueOf(100)).setScale(2, RoundingMode.HALF_UP)); BigDecimal finalTotal = total; BigDecimal invoiceDiscountAmount = optional(invoice.getInvoiceDiscountTax()) .map(t -> finalTotal.multiply(t).divide(BigDecimal.valueOf(100)).setScale(2, RoundingMode.HALF_UP)) .orElse(optional(invoice.getInvoiceDiscount()).orElse(BigDecimal.valueOf(0))); total = total.subtract(invoiceDiscountAmount); if (total.compareTo(invoice.getAmount()) != 0) { addMessages(ValidationMessage.error(TOTAL_AMOUNT + " should be " + total + " instead of " + invoice.getAmount())); } } }
The document type JSON extended with custom postprocessor and validator definition:
Only with these 5 classes above and a change to document type, a full-featured document processing is ready to be executed.
Below are the steps that included into the OOTB document processing:
- obtain files from storage
- creates a document record for every file in datastore IE_SAMPLE_DP_DOCUMENT
- OCR the document
- call ML task for document
- do postprocessing
- do validation
- call HT for document (if validation failed)
- save all data into document record
If amounts validation fails (total amount does not match to the calculation result) - Human Task is created for the human to fix the extraction errors manually:
In this particular case Human Task can easily be avoided with the OOTB "regexReplacement" postprocessors - they will fix most typical OCR errors automatically:
The AP run result in the IE_SAMPLE_DP_DOCUMENTS :
Step 2. OOTB Modification
Let's customize default OOTB Document Proccessing flow - add extra 'GetInvoiceResult' step after ML that persist extracted invoice number and validation result into data store:
package eu.ibagroup.samples.iedp; import eu.ibagroup.easyrpa.ap.dp.DocumentProcessor; import eu.ibagroup.easyrpa.ap.iedp.IeDocumentProcessorBase; import eu.ibagroup.easyrpa.engine.annotation.ApModuleEntry; import eu.ibagroup.easyrpa.engine.apflow.TaskInput; import eu.ibagroup.easyrpa.engine.apflow.TaskOutput; import eu.ibagroup.samples.iedp.entity.IeDocument; import eu.ibagroup.samples.iedp.task.GetInvoiceResult; import lombok.extern.slf4j.Slf4j; import org.slf4j.Logger; import java.util.concurrent.CompletableFuture; @ApModuleEntry(name = "IE DP Sample (Step 2)") @Slf4j public class InvoiceProcessingSample_2 extends DocumentProcessor<IeDocument> implements IeDocumentProcessorBase { @Override public Logger log() { return log; } @Override public CompletableFuture<TaskOutput> processDocument(TaskInput docInput) { return super.processDocument(docInput).thenCompose(execute(GetInvoiceResult.class)); } }
We are obtaining model output json from document, check for errors, and obtain "Invoice Number". Here is GetInvoiceResult code:
package eu.ibagroup.samples.iedp.task; import eu.ibagroup.easyrpa.ap.dp.tasks.to.ContextId; import eu.ibagroup.easyrpa.ap.dp.tasks.to.DocumentId; import eu.ibagroup.easyrpa.ap.iedp.validation.IeValidatorBase; import eu.ibagroup.easyrpa.engine.annotation.ApTaskEntry; import eu.ibagroup.easyrpa.engine.annotation.InputToOutput; import eu.ibagroup.easyrpa.persistence.documentset.DocumentSet; import eu.ibagroup.samples.iedp.entity.IeDocument; import lombok.extern.slf4j.Slf4j; import org.slf4j.Logger; @ApTaskEntry(name = "Get Invoice Result") @Slf4j @InputToOutput(value = { ContextId.KEY, DocumentId.KEY }) public class GetInvoiceResult extends IeDocumentTask implements IeValidatorBase<IeDocument> { @Override public Logger log() { return log; } @Override public void execute() { IeDocument document = documentContext().getDocument(); String invoiceNumber = getExtractedEntities().getValue("Invoice Number"); boolean isValid = isValidDocument(); document.setInvoiceNumber(invoiceNumber); document.setIsValidInvoice(isValid); documentContext().updateDocument(DocumentSet.Status.TAGGED_BY_MODEL); } }
The run history has our tasks in the document processing flow:
The AP run result in the IE_SAMPLE_DP_DOCUMENTS :