Intelligent Document Processing (IDP)
Intelligent Document Processing (IDP)
Overview
Intelligent Document Processing (IDP) Sample focuses on solving customer’s issues with financial documents processing, such as invoices and remittance advices. These documents enable the fundamental transfer of value and currency in commerce and often come in a variety of forms and formats. IDP Sample suggests the way to automate accounts receivable processes such as billing and payment management by end-to-end processing of invoices and remittance advices, including PDFs and scanned images. It leverages Machine Learning to classify and extract data from incoming semi-structured documents of varying layouts.
A great challenge in accounts receivable processes is the fact that they are often quite complex with numerous subsidiary companies and disparate financial systems. IDP Sample solves those challenges by automating the full cycle of document processing and transferring data into a single open-source billing system, such as InvoicePlane. It offers companies the advantage of controlling their subsidiaries' finances and bundling all relevant invoices and payments in one central location.
Intelligent Document Processing Sample Lifecycle includes:
In this sample, the bot retrieves invoice and remittance advice documents from the dedicated folder in the File Storage, prepares the documents to be processed by ML model, classifies the documents, extracts the relevant business data, creates human tasks to validate the extracted data, and imports the data into InvoicePlane thus creating invoices and corresponding payments in the InvoicePlane system.
Prerequisites
In order to successfully set up and run Intelligent Document Processing (IDP) Demo:
- Ensure that you have a Selenium node with a running Chrome driver.
- Upload the IDP Sample package to the Control Server. The package can be found in the following directory: https://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-idp-ap/<EasyRPA version>/easy-rpa-idp-ap-<EasyRPA version>-bin.zip
The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-idp-ap - Ensure the following details are provided for the IDP Sample automation process in the Automation Process Details tab:
- Capabilities: AP_RUN, SELENIUM;
- Repository ID: eu.ibagroup.samples.ap: easy-rpa-idp-ap:jar:full:<EasyRPA version>;
- Module Class: eu.ibagroup.sample.ml.idp.IdpSample.
- Install InvoicePlane locally. To download the latest version please visit https://www.invoiceplane.com/downloads. For InvoicePlane installation instructions please see Setup Invoice Plane Application. Installation steps are also described in https://github.com/InvoicePlane/InvoicePlane/blob/master/README.md.
- Provide the path to the installed InvoicePlane as invoiceplaneClientUrl in IDP Sample AP configuration parameters.
- Upload the demo test set into bucket. Input documents (invoices and remittance advices) that need to be processed by IDP Sample automation process have to be uploaded into idp-sample folder of the File Storage prior to launching the automation process. File Storage url: https://<CS host>:8444/minio/data/idp_sample/
- To run this demo only IDP Sample automation process needs to be launched.
IDP Package Structure:
Folder | Description |
---|---|
IDP Sample | IDP automation process |
CL Document Processor | Standard classification automation process |
IE Document Processor | Standard information extraction automation process |
IDP_SAMPLE_CLASSIFICATION | Classification document set. Contains invoice and remittance advice test samples |
IDP_SAMPLE_INVOICE | Information extraction document set. Contains invoice test samples |
IDP_SAMPLE_REMITTANCE_ADVICE | Information extraction document set. Contains remittance advice test samples |
IDP Sample Document Classification | Classification document type. Defines document categories to be identified |
IDP Sample Invoice | Invoice information extraction document type. Defines the entities to be extracted from invoices. |
IDP Sample Remittance Advice | Remittance Advice information extraction document type. Defines the entities to be extracted from remittance advices |
Classification Task | Classification human task type. Defines the task input form in the Workspace |
Information Extraction Task | Information Extraction human task type. Defines the task input form in the Workspace |
easy-rpa-cldp-ap-<EasyRPA version>.jar | Root archive and dependencies. Contains code of Classification automation process |
easy-rpa-idp-ap-<EasyRPA version>.jar | Root archive and dependencies. Contains code of IDP Sample automation process |
easy-rpa-iedp-ap-<EasyRPA version>.jar | Root archive and dependencies. Contains code of Information Extraction automation process |
model.idp_classification-<version>.tar.gz | Classification ML model |
model.idp_ie_invoice-<version>.tar.gz | Invoice information extraction ML model |
model.idp_ie_remittance-<version>.tar.gz | Remittance Advice information extraction ML model |
invoiceplane.secrets | InvoicePlane login credentials |
Configuration Parameters for IDP Sample Automation Process:
Key | Default Value | Description |
---|---|---|
inputFolder | idp_sample/input | File Storage folder where input documents are stored. |
fileFilter | Regular expression for files to select. | |
cleanUpDemo | true | Boolean parameter for Data Store and InvoicePlane data purge |
configuration | { | Json parameter that provides mapping of document types and corresponding ML models and contains model name, model version and document type name of each model. Information extraction models contain also the parameter ‘task’ which defines RPA task that works with the information extraction result and stores the extracted entities in the target system (InvoicePlane). If the parameter task is not set, the latest model result is stored only in the model_output_json column field of the IDP_SAMPLE_DOCUMENTS data store. ImageMagick and Tesseract OCR settings; scoreThreshold parameter defines the lower classification result score limit below which the document will not be sent for information extraction. If the classification result score is lower than defined score threshold a classification human task will be created in the Workspace prior to sending the document for information extraction. |
invoiceplaneClientUrl | InvoicePlane url |
Data Store for IDP Sample Automation Process:
Name | Columns |
---|---|
IDP_SAMPLE_DOCUMENTS | document_type , model_document_type, document_type_score, cl_result, ie_result, error_message, uuid, name, notes, status, url, s3_path, ocr_json, input_json, output_json, model_output_json, update_timestamp |
Columns description:
- document_type - final document type value (equals model_document_type if classification confidence score threshold is above the scoreThreshold set in configuration parameters).
- model_document_type - document type defined by classification model.
- document_type_score - model confidence score of document classification.
- cl_result - result of classification model execution on the document.
- ie_result - result of information extraction model execution on the document validated in Human Task.
- error_message - message displayed in case of an error.
- uuid - unique identifier of the document.
- name - input document name as it appears in a human task.
- notes - input document path inside file storage bucket.
- status - document processing status.
- url - input document file storage path.
- s3_path - input document path inside file storage bucket.
- ocr_json - result of OCR execution on the document.
- input_json - document input data for the latest human task.
- output_json - document output data of the latest human task.
- model_output_json - temporary field containing latest result of the executed model.
- update_timestamp - last update time of the data store record.
Secret Vault:
Alias | Value |
---|---|
invoiceplane.secrets | {"user": "admin@ibagroup.eu", "password": "o66Lc1Jn6Z"} |
Included Steps
Step 1. Purge Data
If IDP Sample configuration parameter cleanUpDemo is set as ‘true’ the RPA bot will remove all existing data records from InvoicePlane and the data store prior to fulfilling documents processing workflow.
Step 2. Ingest Documents
RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed and creates records in the data store. The data store record of each document contains the initial name of the original document, the path to the File Storage folder with the original document, an associated uuid. The result of document processing on each step of IDP Sample automation process is also recorded in the data store. The status of the document which has just been extracted for processing is 'NEW'.
When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.
After this step a separate workflow of RPA and ML tasks is created for each document.
Step 3. Prepare Documents
On this step input data for ML models execution is prepared. Document images are cleaned with ImageMagick and sent to Tesseract OCR for scanning. File Storage bucket name, Tesseract options and ImageMagick option are provided in configuration parameters. Files created as a result of the original document processing are saved to the same idp-sample File Storage folder where the original document is stored.
Step 4. Classify Documents
Once a document is classified, a pretrained ML Classification model is executed to predict a document’s category: Invoice or Remittance Advice. The ML Classification model also provides a confidence measure indicating how confident it is that the assigned classification tag is correct. The confidence score threshold is provided in configuration parameters. If the document is classified with enough confidence score, it is sent to the next step for processing. If classification result is below confidence score a human task is created as an additional step to verify accuracy of document classification.
Step 5. Extract Data
Depending on the result of the ML Classification model, the Invoice or Remittance Advice ML Information Extraction model is employed to extract the specific fields that need to be stored in the target system.
Step 6. Verify Extracted Data
This step enables human verification and corrections to ensure accuracy of data extracted from a document before it is imported into the target system. After the relevant business entities have been extracted from a document, a human task is created and needs to be completed in Workspace. It contains ML Information Extraction model output that humans can review, validate and correct. Once a document is verified it is picked up by the bot for further processing. The correct order of processing invoices and remittance advices needs to be observed, thus invoice human tasks need to be completed before remittance advice human tasks.
Step 7. Import Extracted Data into Target System
As soon as a human task is completed an RPA task is created to input the extracted data into InvoicePlane system to complete the end-to-end automation. Depending on the document category, either Invoice or Payment RPA task is called.
Invoice RPA Task:
This task is performed for the documents that were classified as 'Invoice' by ML Classification model.
The RPA bot opens the InvoicePlane url defined in invoiceplaneClientUrl setting. Login popup appears where the bot enters credentials from invoiceplane.secrets secret vault alias.
After login the RPA bot opens Tax Rate from Settings menu and checks if the invoice tax rate already exists in InvoicePlane. The search is performed by tax rate name, which equals the size of tax rate without the percent symbol. If the tax rate doesn’t exist in the system the bot creates it.
When the tax rate is set the bot opens View Clients from Clients menu and checks if the required client record already exists in InvoicePlane. To perform the search the extracted value ‘Company Name’ is input into the search field.
If the client record doesn’t exist in InvoicePlane catalogue the bot opens the Add Client form, fills in client’s business data that was extracted from the invoice document and clicks Save.
Once the client record is created the bot creates a new invoice. It opens Create Invoice form, inputs the extracted value 'Company Name' into the Client field, selects the existing 'Client' from the list that appears, selects 'Invoice Date' and clicks Submit button.
On the New Invoice page the bot opens Add Invoice Tax from the Options menu, inputs the extracted value 'Tax Rate' into the Invoice Tax Rate field, selects the 'Tax Rate' from the list that appears and clicks Submit button.
The bot changes invoice status to ‘Sent’, sets payment method as 'Cash' and selects 'Due Date'. To add products to the invoice the bot fills in Item, Description, Quantity, Price and Discount Rate fields with the extracted data.
The bot inputs the extracted 'Invoice Number' into Invoice # field and clicks Save.
The document receives status 'READY' in the data store.
A new invoice that replicates the data from the scanned document in successfully created in InvoicePlane.
If the bot encounters an incorrect data error, for example, the number of product items doesn’t correspond to the number of other product data such as quantity, description and price, the extracted invoice data is not imported to InvoicePlane and the error details are saved to the data store. The document status in the data store is changed to 'ERROR'.
Payment RPA Task:
This task is performed for the documents that were classified as 'Remittance Advice' by ML classification model.
The RPA bot opens the InvoicePlane url defined in invoiceplaneClientUrl setting. Login popup appears where the bot enters credentials from invoiceplane.secrets secret vault alias.
After login the bot opens View Invoices from Invoices menu and checks if an invoice with the invoice number that equals the one extracted from remittance advice exists in InvoicePlane. To perform the search the extracted value ‘Invoice Number’ is input into the search field.
If the required invoice exists in InvoicePlane the bot opens Enter Payment from Options menu of the invoice. The bot inputs the values ‘Payment Amount’ and ‘Payment Date’ that were extracted from remittance advice documents into the corresponding fields and clicks Submit.
The document receives status 'READY' in the data store.
The payment for the invoice is successfully created in InvoicePlane.
The invoice receives status 'Paid' in InvoicePlane.
If the required invoice doesn’t exist in InvoicePlane the extracted remittance advice data is not imported, a payment is not created and the error details are saved to the data store. The document status in the data store is changed to 'ERROR'.
Step 8. Documents are saved in the relevant document set
Documents are stored in the corresponding document sets. Documents of the Invoice class are saved in the READY or ERROR status acquired previously (in Step 7) in the IDP_SAMPLE_INVOICE document set. Documents of the Remittance Advice class are saved in the READY or ERROR status acquired previously (in Step 7) in the IDP_SAMPLE_REMITTANCE_ADVICE document set.
On document set Sample version
There is a Document Set version of the Intelligent Document Processing Sample - IDP Sample (on DocumentSet), that is has the same functional with difference that documents are stored in corresponding document sets instead of IDP_SAMPLE_DOCUMENTS data store.
The AP obtains document from storage, add them into IDP_SAMPLE_CLASSIFICATION document set, performs classification. Them creates corresponding document in the IDP_SAMPLE_INVOICE and IDP_SAMPLE_REMITTANCE_ADVICE performs information extractions and so on. The sample demonstrates thePropagate documents between DocumentContexts in Document Procesors AP development.