Skip to main content

Financial Information Extraction Sample Process (FinextSample)

Financial Information Extraction Sample Process (FinextSample)

Overview

Finext Sample performs automatic processing of PDF documents. The purpose of the process is to extract the following basic financial information from documents using the rule-based model:

  • Country references (full names and codes ISO 3166)
  • Currency references (full and codes ISO 4217)
  • IBANs
  • Swift codes
  • Numbers and amounts (in digital format and written in text)
  • Email addresses

Finext Sample Lifecycle includes:

Prerequisites

In order to successfully set up and run Finext Sample Process:

  1. Ensure that you have a running node with the "AP_RUN" capabilities.

  2. Upload the Finext Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-finext-ap/<EasyRPA version>/easy-rpa-finext-ap-<EasyRA version>-bin.zip
    The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-finext-ap

  3. Ensure the following details are provided for the Finext automation process in the Automation Process Details tab

    Module class: eu.ibagroup.sample.ml.finext.FinextSample

    Group Id: eu.ibagroup.samples.ap

    Artifact Id: easy-rpa-finext-ap

    Version Id: <EasyRPA version>

Finext Sample Process Package structure

Folder

Description

Finext Sample

Finext Sample automation process

IE Document Processor

Standard information extraction automation process

FINEXT_SAMPLE

Information extraction document set. Contains financial documents test samples

Finext Sample

Invoice information extraction document type. Defines the entities to be extracted from financial documents.

Information Extraction Task

Information Extraction human task type. Defines the task input form in the Workspace

easy-rpa-finext-ap-<EasyRPA version>.jar

Root archive and dependencies. Contains code of Finext Sample automation process

easy-rpa-iedp-ap-<EasyRPA version>.jarRoot archive and dependencies. Contains code of IE Document Processor automation process

easyrpaml_finext_model-<version>.tar.gz

Rule-based information extraction model

storage/data

Folder that contains documents to be uploaded in File Storage

Configuration Parameters for Finext Sample Automation Process:

Key

Default Value

Description

inputFolder

finext_sample/input

File Storage folder where input documents are stored.

fileFilter

1.*\.pdf

Regular expression for files to select.

configuration

{
    "finext": {
        "dataStore": "FINEXT_SAMPLE_DOCUMENTS",
        "documentType": "Finext Sample",
        "model": "ml_ie_finext_model",
        "runModel": "ml_ie_finext_model,2.5.1",
        "storagePath": "finext_sample",
        "exportDocumentSet": "FINEXT_SAMPLE",
        "bucket": "data",
        "tesseractOptions": ["-l", "eng", "--psm", "3", "--oem", "3", "--dpi", "800"],
        "imageMagickOptions": ["-resample", "450", "-density", "350", "-quality", "100", "-background", "white", "-alpha", "flatten"]
    }
}

dataStore - the datastore name where to store input documents

documentType - the document type name to use for model call

runModel - the model name and version to run

storagePath - the document path on a storage to use

imageMagickOptions - ImageMagick settings

tesseractOptions - Tesseract OCR settings

Model rules

Country

The model includes a .csv file config_countries.csv with short abbreviations and possible full titles, for example:

  • US,United States of America
  • US,United States
  • US,USA
  • US,U.S.A

The processor extracts all instances of the short or full name.

Currency

The model includes a .csv file config_currencies.csv with pairs of abbreviations and short or full names, as follows:

  • US Dollar,USD
  • $,USD

The processor extracts all instances of the short/full name or abbreviation.

IBAN 

IBAN values are extracted by regular expression and validated by IBAN length from iban_countries_length.csv config_iban_countries_length.csv (different countries have different lengths)

SWIFT

SWIFT values are extracted by regular expression.

Amount

Model extracts numbers and numerals that can be amounts. Examples:

  • 3,000,000.00
  • 5.000
  • two hundred twenty five
  • one thousand and four

Email

Emails are extracted by regular expression.

Included Steps

Step 1. Ingest Documents

RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed. The status of the document which has just been extracted for processing is 'NEW'.

When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.

After this step a separate workflow of RPA and ML tasks is created for each document.

Step 2. Prepare Documents

On this step input data for rule-based model execution is prepared. Files created as a result of the original document processing are saved to the same finext_sample File Storage folder where the original document is stored.

Step 3. Extract Data

The Rule-based Information Extraction model is employed to extract the specific fields from financial documents.

Step 4. Verify Extracted Data

This step enables human verification and corrections to ensure accuracy of data extracted. After the relevant business entities have been extracted from a document, a human task is created and needs to be completed in Workspace. It contains Rule-based Information Extraction model output that humans can review, validate and correct.