Financial Information Extraction Sample Process (FinextSample)
Financial Information Extraction Sample Process (FinextSample)
Overview
Finext Sample performs automatic processing of PDF documents. The purpose of the process is to extract the following basic financial information from documents using the rule-based model:
- Country references (full names and codes ISO 3166)
- Currency references (full and codes ISO 4217)
- IBANs
- Swift codes
- Numbers and amounts (in digital format and written in text)
- Email addresses
Finext Sample Lifecycle includes:
Prerequisites
In order to successfully set up and run Finext Sample Process:
Ensure that you have a running node with the "AP_RUN" capabilities.
Upload the Finext Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-finext-ap/<EasyRPA version>/easy-rpa-finext-ap-<EasyRA version>-bin.zip
The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-finext-apEnsure the following details are provided for the Finext automation process in the Automation Process Details tab
Module class: eu.ibagroup.sample.ml.finext.FinextSample
Group Id: eu.ibagroup.samples.ap
Artifact Id: easy-rpa-finext-ap
Version Id: <EasyRPA version>
Finext Sample Process Package structure
Folder | Description |
---|---|
Finext Sample | Finext Sample automation process |
IE Document Processor | Standard information extraction automation process |
FINEXT_SAMPLE | Information extraction document set. Contains financial documents test samples |
Finext Sample | Invoice information extraction document type. Defines the entities to be extracted from financial documents. |
Information Extraction Task | Information Extraction human task type. Defines the task input form in the Workspace |
easy-rpa-finext-ap-<EasyRPA version>.jar | Root archive and dependencies. Contains code of Finext Sample automation process |
easy-rpa-iedp-ap-<EasyRPA version>.jar | Root archive and dependencies. Contains code of IE Document Processor automation process |
easyrpaml_finext_model-<version>.tar.gz | Rule-based information extraction model |
storage/data | Folder that contains documents to be uploaded in File Storage |
Configuration Parameters for Finext Sample Automation Process:
Key | Default Value | Description |
---|---|---|
inputFolder | finext_sample/input | File Storage folder where input documents are stored. |
fileFilter | 1.*\.pdf | Regular expression for files to select. |
configuration | { | dataStore - the datastore name where to store input documents documentType - the document type name to use for model call runModel - the model name and version to run storagePath - the document path on a storage to use imageMagickOptions - ImageMagick settings tesseractOptions - Tesseract OCR settings |
Model rules
Country
The model includes a .csv file config_countries.csv with short abbreviations and possible full titles, for example:
- US,United States of America
- US,United States
- US,USA
- US,U.S.A
The processor extracts all instances of the short or full name.
Currency
The model includes a .csv file config_currencies.csv with pairs of abbreviations and short or full names, as follows:
- US Dollar,USD
- $,USD
The processor extracts all instances of the short/full name or abbreviation.
IBAN
IBAN values are extracted by regular expression and validated by IBAN length from iban_countries_length.csv config_iban_countries_length.csv (different countries have different lengths)
SWIFT
SWIFT values are extracted by regular expression.
Amount
Model extracts numbers and numerals that can be amounts. Examples:
- 3,000,000.00
- 5.000
- two hundred twenty five
- one thousand and four
Emails are extracted by regular expression.
Included Steps
Step 1. Ingest Documents
RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed. The status of the document which has just been extracted for processing is 'NEW'.
When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.
After this step a separate workflow of RPA and ML tasks is created for each document.
Step 2. Prepare Documents
On this step input data for rule-based model execution is prepared. Files created as a result of the original document processing are saved to the same finext_sample File Storage folder where the original document is stored.
Step 3. Extract Data
The Rule-based Information Extraction model is employed to extract the specific fields from financial documents.
Step 4. Verify Extracted Data
This step enables human verification and corrections to ensure accuracy of data extracted. After the relevant business entities have been extracted from a document, a human task is created and needs to be completed in Workspace. It contains Rule-based Information Extraction model output that humans can review, validate and correct.