Information Extraction TEXT Sample Process (IeTEXTSample)
Information Extraction TEXT Sample Process (IeTEXTSample)
Overview
IeTEXT Sample performs automatic processing of TXT documents. The purpose of the process is to extract the following fields from the TXT order invoice documents:
Invoice Number
Invoice Date
Due Date
Company Name
Street Address
City
Zip Code
Phone Number
Email
Product Name
Product Description
Quantity
Price
Tax Rate
Discount Rate
Total Discount
Total Amount
IeTEXT Sample Lifecycle includes:
Prerequisites
In order to successfully set up and run IeTEXT Sample Process:
Ensure that you have a running node with the "AP_RUN" capabilities.
Upload the IeTEXT Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-ietext-ap/<EasyRPA version>/easy-rpa-ietext-ap-<EasyRA version>-bin.zip
The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-ietext-apEnsure the following details are provided for the IeTEXT automation process in the Automation Process Details tab
Module class: eu.ibagroup.sample.ml.ietext.IeTextSample
Group Id: eu.ibagroup.samples.ap
Artifact Id: easy-rpa-ietext-ap
Version Id: <EasyRPA version>
IeText Sample Process Package structure
Artefact | Description |
---|---|
TEXT IE Sample | IeText Sample automation process |
IE_TEXT_SAMPLE_DOCUMENTS | Datastore used for extraction automation process |
IE_TEXT_SAMPLE | Information extraction document set. Contains order invoice test samples |
TEXT IE Sample | Invoice information extraction document type. Defines the entities to be extracted from order invoices. |
HTML Information Extraction Task | Information Extraction human task type. Defines the task input form in the Workspace |
easy-rpa-ietext-ap-<EasyRPA version>.jar | Root archive and dependencies. Contains code of IeTEXT Sample automation process |
html_ie_invoice_pre-<version>.tar | Information extraction from TEXT order invoices ML model |
storage/data | Folder that contains documents to be uploaded in File Storage |
Configuration Parameters for IeTEXT Sample Automation Process:
Key | Default Value | Description |
---|---|---|
inputFolder | ietext_sample/input | File Storage folder where input documents are stored. |
fileFilter | .*\.txt | Regular expression for files to select. |
configuration | { | dataStore - the datastore name where to store input documents documentType - the document type name to use for classification. runModel - the model name and version to run storagePath - the document path on a storage to use Json parameter that provides mapping of document types and corresponding ML models and contains model name, model version and document type name of each model. |
Included Steps
Step 1. Ingest Documents
RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed. The status of the document which has just been extracted for processing is 'NEW'.
When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.
After this step a separate workflow of RPA and ML tasks is created for each TXT document.
Step 2. Prepare Documents
On this step input data for ML model execution is prepared. As part of the preparation TXT→HTML conversion is employed (done in the HTML IE Document Processor itself - no additional setup is required). Files created as a result of the original document processing are saved to the same ietext_sample File Storage folder where the original document is stored.
Step 3. Extract Data
The ML Information Extraction model is employed to extract the specific fields from HTML invoice order documents.
Step 4. Verify Extracted Data
This step enables human verification and corrections to ensure accuracy of data extracted. After the relevant business entities have been extracted from a document, a human task is created and needs to be completed in Workspace. It contains ML Information Extraction model output that humans can review, validate and correct.