Skip to main content

Accounting Information Extraction from Multi-Page Reports

Accounting Information Extraction from Multi-Page Reports

Overview

Accounting Info Extraction Sample focuses on solving customer’s issues with accounting report documents processing - where documents such as reports contain dozens of pages and only few of them should be processed.  Sample suggests the way to automate extraction of only relevant report financial data together with auto-mapping of extracted results into Datastore. That is achieved with the subsequent use of dedicated page-filtering post-processor for the OCRed documents plus dedicated auto-mapper post-processor to finally provide valid extracted data directly into a Datastore.

Prerequisites

In order to successfully set up and run Accounting Information Extraction sample:

  1. Ensure that you have a running node with the "AP_RUN" capabilities.
  2. Upload the sample package to the Control Server. The package can be found in the following directory: https://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-accounting-info-ap/<EasyRPA version>/easy-rpa-accounting-info-ap-<EasyRPA version>-bin.zip
    The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-accounting-info-ap
  3. Ensure the following details are provided for the sample automation process in the Automation Process Details tab:

    Module class: eu.ibagroup.sample.ml.signature.recognition.SignatureRecognitionSample

    Group Id: eu.ibagroup.samples.ap

    Artifact Id: easy-rpa-accounting-info-ap

    Version Id: <EasyRPA version>

  4. Ensure the Control Server has the following models in the list

    • annual_reports_info_extraction_model 
  5. To run this demo only Accounting Info Extraction Sample automation process needs to be launched.



Package Structure:

Folder

Description

Accounting Info Extraction Sample

Accounting Info Extraction Sample automation process

Accounting Info Sample report

Report information extraction document type. Defines the entities to be extracted  from reports, also defines combo of postprocessor + validator to filter input pages and auto-map extracted results into data store

	"preprocessPostProcessors": [
		{
		"name": "selectInputPagesWithKeywords",
		"keywords": {
		"annualProfitAndLossPage": "Profit and Loss Account for the year;after taxation;Turnover;Profit and Loss Account;Profit for the financial year;Profit/(Loss) for the financial year;before taxation;Loss for the financial year;Operating Profit;Operating Loss;Profit and Loss;Income for the financial year;Profit after taxation;Profit before taxation;Profit/(Loss) before taxation;Statement of Comprehensive income;Income statement;Statement of Profit or Loss;Profit for the year;Loss for the year;before income tax;(Loss)/profit for the financial period;(Loss)/profit for the financial period;(Loss)/profit before tax;Operating (loss)/profit",
		"balancePage": "Balance Sheet;Net assets;Total shareholder's funds;Total equity shareholders' funds;equity shareholders' funds;shareholders' funds;shareholders funds;Fixed assets;Investments;Net current liabilities;Current assets;Debtors;Net current assets;Capital and Reserves;Statement of Financial Position;Total Assets;Total Equity;Total Liabilities;Total Equity and Liabilities"
		}
		}
	],
	"validators": [
		{
		"name": "autoMapEntitiesToDocument"
		}	
	]

easy-rpa-accounting-info-ap-<EasyRPA version>.jar

Root archive and dependencies. Contains code of sample automation process

annual_reports_info_extraction_model-1.0.tar

trained model to extract data from annual reports

storage/dataFolder that contains annual report pdf documents to be uploaded in File Storage

Configuration Parameters for Accounting Info Extraction Automation Process:

Key

Default Value

Description

inputFolder

annual_reports_sample/input

File Storage folder where input documents and reference images are stored.

fileFilter

.*\.pdf

Regular expression for files to select.

configuration

{
  "DEFAULT": {
    "dataStore": "ACCOUNTING_INFO_SAMPLE_DOCUMENTS",
    "documentType": "Accounting Info Sample Report",
    "model": "annual_reports_info_extraction_model",
    "runModel": "annual_reports_info_extraction_model,1.0",
    "storagePath": "annual_reports_sample",
    "bucket": "data",
    "tesseractOptions": [
      "-l",
      "eng",
      "--psm",
      "6",
      "--oem",
      "3",
      "--dpi",
      "350"
    ],
    "imageMagickOptions": [
      "-units",
      "PixelsPerInch",
      "-resample",
      "350",
      "-density",
      "350",
      "-quality",
      "100",
      "-background",
      "white",
      "-deskew",
      "40%",
      "-alpha",
      "flatten"
    ]
  }
}


Json parameter that provides mapping of document types and corresponding ML models and contains model name, model version and document type name .



Data Store for Sample:

Name

Columns

ACCOUNTING_INFO_SAMPLE_DOCUMENTS

units, operating_profit, profit_before_tax, tax_on_profit, year_profit, net_assets, uuid, name, notes, status, url, s3_path, ocr_json, input_json, output_json, model_output_json, update_timestamp

Columns description:
  • units  - extracted units option of the report
  • operating_profit - extracted operation profit of the report
  • profit_before_tax -extracted profit before tax
  • tax_on_profit - extracted tax on profit
  • year_profit -extracted year's profit
  • net_assets - extracted net assets amount
  • uuid - unique identifier of the document.
  • name - input document name as it appears in a human task.
  • notes - input document path inside file storage bucket.
  • status - document processing status.
  • url - input document file storage path.
  • s3_path - input document path inside file storage bucket.
  • ocr_json - result of OCR execution on the document.
  • input_json - document input data for the latest human task.
  • output_json - document output data of the latest human task.
  • model_output_json - temporary field containing latest result of the executed model.
  • update_timestamp - last update time of the data store record.

Included Steps

Step 1. Ingest Documents

RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed and creates records in the data store. The data store record of each document contains the initial name of the original document, the path to the File Storage folder with the original document, an associated uuid. The result of document processing on each step of sample automation process is also recorded in the data store. The status of the document which has just been extracted for processing is 'NEW'.

When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.

After this step a separate workflow of RPA and ML tasks is created for each document.

Step 2. Prepare Documents

On this step input data for ML models execution is prepared. Document images are cleaned with ImageMagick and sent to Tesseract OCR for scanning. File Storage bucket name, Tesseract options and ImageMagick option are provided in configuration parameters. Files created as a result of the original document processing are saved to the same annual_reports_sample File Storage folder where the original document is stored.

Once the pdfs get OCRed, the custom 'selectInputPagesWithKeywords' postprocessor checks the pages and leaves only the ones that have given keywords (keywords param) inside. Only these page will go further into processing by model step.

Step 3. Extract Data

The provided annual_reports_info_extraction_model  is employed to extract the specific fields that need to be stored in the target system.

Step 4. Validate Extracted Data

At this stage two validator are employed:

  • an OOTB validator that ensures all mandatory data is present in the extraction results
  • autoMapEntitiesToDocument - for valid extracted document will try auto-map the extracted values into 'AccountingInfo' Datastore Entity fields.

Step 5. Verify Extracted Data

This step enables human verification and corrections to ensure accuracy of data extracted from a document before it is imported into the target system. After the relevant business entities have been extracted from a document, a human task is created and needs to be completed in Workspace. It contains ML Information Extraction model output together with ML Signature Recognition model output plus messages from validators that humans can review, validate and correct.