Skip to main content

Develop Automation Process (Classification) using Document Processors

Develop Automation Process (Classification) using Document Processors

Preface

For the ML Automation Process we are going to use platform Document Processors a base.

In our step-by-step example we're going to implement EasyRPA Storage scanning for new incoming invoices.

The full completed project can be obtained from the samples GIT https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/confluence-samples/classification-sample-dp

In the first step, you will need to download the archetype. The following link provides more information on how to do this: Generate project from archetype. In the generated from archetype project, the following should be added to pom.xml:

pom.xml
	 <dependencies>
		 . . . .
		<dependency>
			<groupId>eu.ibagroup</groupId>
			<artifactId>easy-rpa-aps</artifactId>
			<version>${rpaplatform.version}</version>
		 </dependency>
		 . . . .
	<dependencies>

Second thing we need to do - define configuration for the AP run:

  • set datastore name for documents
  • set storage path for files
  • set run model
  • set OCR options
apm_run.properties
# EasyRPA Client Configuration File
inputFolder=cl_sample_dp/input
fileFilter=.*\\.pdf
removeImported=false
configuration={"DEFAULT": {"dataStore": "CL_SAMPLE_DP_DOCUMENTS","documentType": "Incoming Documents Classification","model": "incoming_documents_classification","runModel": "incoming_documents_classification,1.0","storagePath": "cl_sample_dp","bucket": "data","tesseractOptions": ["-l", "eng", "--psm", "12", "--oem", "3", "--dpi", "150"],"imageMagickOptions": ["-units", "PixelsPerInch", "-resample", "150", "-density", "150", "-quality", "100", "-background", "white", "-deskew", "40%", "-contrast", "-alpha", "flatten"]}}
classification_threshold=0.53
remoteExecutionService.enableLocalExecution=true



Step 1. OOTB Document Processing

This step is completely the same as for Information Extraction: Step 1. Prepare input documents (IE)

Reuse OOTB code with no flow modification

The Automation Process (AP) class will looks like the following:

InvoiceClassificationSample_1.java
package eu.ibagroup.samples.cldp;

import eu.ibagroup.easyrpa.ap.cldp.ClDocumentProcessorBase;
import eu.ibagroup.easyrpa.ap.dp.DocumentProcessor;
import eu.ibagroup.easyrpa.engine.annotation.ApModuleEntry;
import eu.ibagroup.samples.cldp.entity.ClDocument;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.Logger;

@ApModuleEntry(name = "Classification DP Sample (Step 1)")
@Slf4j
public class InvoiceClassificationSample_1 extends DocumentProcessor<ClDocument> implements ClDocumentProcessorBase {

	@Override
	public Logger log() {
		return log;
	}

}

The entity, we define new field for future use:

ClDocument.java
package eu.ibagroup.samples.cldp.entity;

import eu.ibagroup.easyrpa.ap.dp.entity.DpDocument;
import eu.ibagroup.easyrpa.persistence.annotation.Column;
import eu.ibagroup.easyrpa.persistence.annotation.Entity;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.ToString;

@Data
@NoArgsConstructor
@AllArgsConstructor
@Entity(value = "CL_SAMPLE_DP_DOCUMENTS")
@ToString
public class ClDocument extends DpDocument {

	@Column("category")
	private String category;

	@Column("score")
	private Double score;

}

The repository:

ClDocumentRepository.java
package eu.ibagroup.samples.cldp.repository;

import eu.ibagroup.easyrpa.ap.dp.repository.DpDocumentRepository;
import eu.ibagroup.samples.cldp.entity.ClDocument;

public interface ClDocumentRepository extends DpDocumentRepository<ClDocument> {
}

The custom postprocessor, to fill entity 'category' and 'score' fields after ML :

StoreClMlResult.java
package eu.ibagroup.samples.cldp.postprocessors;

import javax.inject.Inject;

import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorMethod;
import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorStrategies;
import eu.ibagroup.easyrpa.ap.dp.model.ClCategories;
import eu.ibagroup.easyrpa.ap.dp.postprocessing.BasePostProcessor;
import eu.ibagroup.easyrpa.ap.dp.postprocessing.ClPostProcessorBase;
import eu.ibagroup.samples.cldp.entity.ClDocument;
import eu.ibagroup.samples.cldp.repository.ClDocumentRepository;
import lombok.Getter;

@PostProcessorStrategies("classification")
public class StoreClMlResult extends BasePostProcessor<ClDocument> implements ClPostProcessorBase<ClDocument> {

	@Inject
	@Getter
	private ClDocumentRepository documentRepository;

	@PostProcessorMethod("storeClMlTaskResults")
	public void idpSampleStoreClMlTaskResults() {

		log().debug("Storing response from CL ML Task for document {} ", documentContext().getDocumentId());
		ClDocument document = documentContext().getDocument();

		ClCategories clCategories = getClCategories();
		String resultCategory = clCategories.getCategory();
		Double resultScore = clCategories.getCategoryScore(resultCategory);
		document.setCategory(resultCategory);
		document.setScore(resultScore);
	}

}

The custom validator - to validate number of document pages and send to Human review if document is multi-page:

ValidateClMlResult.java
package eu.ibagroup.samples.cldp.validators;

import javax.inject.Inject;

import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorMethod;
import eu.ibagroup.easyrpa.ap.dp.annotation.PostProcessorStrategies;
import eu.ibagroup.easyrpa.ap.dp.model.HocrInputJson;
import eu.ibagroup.easyrpa.ap.dp.postprocessing.BasePostProcessor;
import eu.ibagroup.easyrpa.ap.dp.validation.ClValidatorBase;
import eu.ibagroup.easyrpa.ap.dp.validation.ValidationMessage;
import eu.ibagroup.samples.cldp.entity.ClDocument;
import eu.ibagroup.samples.cldp.repository.ClDocumentRepository;
import lombok.Getter;

@PostProcessorStrategies("classification")
public class ValidateClMlResult extends BasePostProcessor<ClDocument> implements ClValidatorBase<ClDocument> {

	@Inject
	@Getter
	private ClDocumentRepository documentRepository;

	@PostProcessorMethod("validateClMlResults")
	public void idpSampleValidateInvoiceAmounts() {
		if(new HocrInputJson(documentContext()).findPages().size()>1 ) {
			addMessages(ValidationMessage.error("Please review classified category for multi-page document"));
		}
	}
}


The document type JSON extended with custom postprocessor and validator definition:

Products invoice.json
{
	"importStrategy": "OVERRIDE",
	"name": "Incoming Documents Classification",
	"description": "Incoming Documents Classification",
	"humanTaskTypeName": "Classification Task",
	"settings": {
		"appLanguage": "en",
		"taskTypeLabel": "Incoming Documents Classification",
		"multipleChoice": false,
		"categories": [
			"Invoice",
			"Remittance Advice"
		],
		"metadata": [
			{
				"name": "isInvalid",
				"markLabel": "INVALID Document",
				"description": "Select, if you have problem with the document"
			},
			{
				"name": "error_message",
				"label": "Problem explanation",
				"type": "textarea",
				"required": false
			},
			{
				"name": "notes",
				"label": "Document notes",
				"type": "textarea",
				"required": false
			}
		],
		"mlPostProcessors": [
			{
				"name": "storeClMlTaskResults"
			}
		],
		"validators": [
			{
				"name": "validateClMlResults"
			}
		]
	}
}

Only with these 5 classes above and a change to document type,  a full-featured document processing is ready to be executed.

Below are the steps that included into the OOTB classification document processing: 

  • obtain files from storage
  • creates a document record for every file in datastore CL_SAMPLE_DP_DOCUMENT
  • OCR the document
  • call ML task for document
  • call postprocessors (if any)
  • call validators (the default OOTB one to check score against threshold + custom validators if any) 
  • call HT for document (if validation failed)
  • save all data into document record


The AP run result:




Step 2. OOTB Modification

This step is completely the same as for Information Extraction: Step 2. OOTB Modification

Let's customize default OOTB Document Proccessing flow - add extra 'GetInvoiceResult' step after ML that persist category and score results into data store:

InvoiceClassificationSample_2.java
package eu.ibagroup.samples.cldp;

import eu.ibagroup.easyrpa.ap.cldp.ClDocumentProcessorBase;
import eu.ibagroup.easyrpa.ap.dp.DocumentProcessor;
import eu.ibagroup.easyrpa.engine.annotation.ApModuleEntry;
import eu.ibagroup.easyrpa.engine.apflow.TaskInput;
import eu.ibagroup.easyrpa.engine.apflow.TaskOutput;
import eu.ibagroup.samples.cldp.entity.ClDocument;
import eu.ibagroup.samples.cldp.task.GetResult;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.Logger;

import java.util.concurrent.CompletableFuture;

@ApModuleEntry(name = "Classification DP Sample (Step 2)")
@Slf4j
public class InvoiceClassificationSample_2 extends DocumentProcessor<ClDocument> implements ClDocumentProcessorBase {

	@Override
	public Logger log() {
		return log;
	}

	@Override
	public CompletableFuture<TaskOutput> processDocument(TaskInput docInput) {
		return super.processDocument(docInput).thenCompose(execute(GetResult.class));
	}

}

We are obtaining model output json from document and obtaining result category and score. Here is GetInvoiceResult code:

GetResult.java
package eu.ibagroup.samples.cldp.task;

import eu.ibagroup.easyrpa.ap.dp.model.ClCategories;
import eu.ibagroup.easyrpa.ap.dp.tasks.to.ContextId;
import eu.ibagroup.easyrpa.ap.dp.tasks.to.DocumentId;
import eu.ibagroup.easyrpa.ap.dp.validation.ClValidatorBase;
import eu.ibagroup.easyrpa.engine.annotation.ApTaskEntry;
import eu.ibagroup.easyrpa.engine.annotation.InputToOutput;
import eu.ibagroup.easyrpa.persistence.documentset.DocumentSet;
import eu.ibagroup.samples.cldp.entity.ClDocument;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.Logger;

@ApTaskEntry(name = "Get Classification Result")
@Slf4j
@InputToOutput(value = { ContextId.KEY, DocumentId.KEY })
public class GetResult extends ClDocumentTask implements ClValidatorBase<ClDocument> {

	@Override
	public Logger log() {
		return log;
	}

	@Override
	public void execute() {
		ClDocument document = documentContext().getDocument();

		ClCategories clCategories = getClCategories();
		String resultCategory = clCategories.getCategory();
		Double resultScore = clCategories.getCategoryScore(resultCategory);
		document.setCategory(resultCategory);
		document.setScore(resultScore);

		documentContext().updateDocument(DocumentSet.Status.TAGGED_BY_MODEL);
	}

}

Set the extracted fields into document and save it. The run history has our tasks in the document processing flow: