Overview

Finext Sample performs automatic processing of PDF documents. The purpose of the process is to extract the following basic financial information from documents using the rule-based model:

Country references (full names and codes ISO 3166)
Currency references (full and codes ISO 4217)
IBANs
Swift codes
Numbers and amounts (in digital format and written in text)
Email addresses

Finext Sample Lifecycle includes:

Step 1. Ingest Documents
Step 2. Prepare Documents
Step 3. Extract Data
Step 4. Verify Extracted Data

Prerequisites

In order to successfully set up and run Finext Sample Process:

Ensure that you have a running node with the "AP_RUN" capabilities.
Upload the Finext Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-finext-ap/<EasyRPA version>/easy-rpa-finext-ap-<EasyRA version>-bin.zip
The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-finext-ap
Ensure the following details are provided for the Finext automation process in the Automation Process Details tab
Module class: eu.ibagroup.sample.ml.finext.FinextSample
Group Id: eu.ibagroup.samples.ap
Artifact Id: easy-rpa-finext-ap
Version Id: <EasyRPA version>

Finext Sample Process Package structure

Folder	Description
Finext Sample	Finext Sample automation process
IE Document Processor	Standard information extraction automation process
FINEXT_SAMPLE	Information extraction document set. Contains financial documents test samples
Finext Sample	Invoice information extraction document type. Defines the entities to be extracted from financial documents.
Information Extraction Task	Information Extraction human task type. Defines the task input form in the Workspace
easy-rpa-finext-ap-<EasyRPA version>.jar	Root archive and dependencies. Contains code of Finext Sample automation process
easy-rpa-iedp-ap-<EasyRPA version>.jar	Root archive and dependencies. Contains code of IE Document Processor automation process
easyrpaml_finext_model-<version>.tar.gz	Rule-based information extraction model
storage/data	Folder that contains documents to be uploaded in File Storage

Configuration Parameters for Finext Sample Automation Process:

Key

Default Value

Description

inputFolder

finext_sample/input

File Storage folder where input documents are stored.

fileFilter

1.*\.pdf

Regular expression for files to select.

configuration

{
"finext": {
"dataStore": "FINEXT_SAMPLE_DOCUMENTS",
"documentType": "Finext Sample",
"model": "ml_ie_finext_model",
"runModel": "ml_ie_finext_model,2.5.1",
"storagePath": "finext_sample",
"exportDocumentSet": "FINEXT_SAMPLE",
"bucket": "data",
"tesseractOptions": ["-l", "eng", "--psm", "3", "--oem", "3", "--dpi", "800"],
"imageMagickOptions": ["-resample", "450", "-density", "350", "-quality", "100", "-background", "white", "-alpha", "flatten"]
}
}

dataStore - the datastore name where to store input documents

documentType - the document type name to use for model call

runModel - the model name and version to run

storagePath - the document path on a storage to use

imageMagickOptions - ImageMagick settings

tesseractOptions - Tesseract OCR settings

Model rules

Country

The model includes a .csv file config_countries.csv with short abbreviations and possible full titles, for example:

US,United States of America
US,United States
US,USA
US,U.S.A

The processor extracts all instances of the short or full name.

Currency

The model includes a .csv file config_currencies.csv with pairs of abbreviations and short or full names, as follows:

US Dollar,USD
$,USD

The processor extracts all instances of the short/full name or abbreviation.

IBAN

IBAN values are extracted by regular expression and validated by IBAN length from iban_countries_length.csv config_iban_countries_length.csv (different countries have different lengths)

SWIFT

SWIFT values are extracted by regular expression.

Amount

Model extracts numbers and numerals that can be amounts. Examples:

3,000,000.00
5.000
two hundred twenty five
one thousand and four

Email

Emails are extracted by regular expression.

Included Steps

Step 1. Ingest Documents

RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed. The status of the document which has just been extracted for processing is 'NEW'.

When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.

After this step a separate workflow of RPA and ML tasks is created for each document.

Step 2. Prepare Documents

On this step input data for rule-based model execution is prepared. Files created as a result of the original document processing are saved to the same finext_sample File Storage folder where the original document is stored.

Step 3. Extract Data

The Rule-based Information Extraction model is employed to extract the specific fields from financial documents.

Step 4. Verify Extracted Data

This step enables human verification and corrections to ensure accuracy of data extracted. After the relevant business entities have been extracted from a document, a human task is created and needs to be completed in Workspace. It contains Rule-based Information Extraction model output that humans can review, validate and correct.

Financial Information Extraction Sample Process (FinextSample)