Skip to main content

Information Extraction HTML Sample Process (HTML IE Sample)

Information Extraction HTML Sample Process (HTML IE Sample)

Overview

IeHTML Sample performs automatic processing of HTML documents. The purpose of the process is to extract the following fields from the HTML order invoice documents:

  • Invoice Number

  • Invoice Date

  • Due Date

  • Company Name

  • Street Address

  • City

  • Zip Code

  • Phone Number

  • Email

  • Product Name

  • Product Description

  • Quantity

  • Price

  • Tax Rate

  • Discount Rate

  • Total Discount

  • Total Amount

HTML IE Sample Lifecycle includes:

Prerequisites

In order to successfully set up and run HTML IE Sample Process:

  1. Ensure that you have a running node with the "AP_RUN" capabilities.

  2. Upload the HTML IE Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-iehml-ap/<EasyRPA version>/easy-rpa-iehml-ap-<EasyRA version>-bin.zip
    The source code can be found here:  https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-iehml-ap

  3. Ensure the following details are provided for the HTML IE Sample automation process in the Automation Process Details tab

    Module class: eu.ibagroup.sample.ml.iehtml.IeHtmlSample

    Group Id: eu.ibagroup.samples.ap

    Artifact Id: easy-rpa-iehml-ap

    Version Id: <EasyRPA version>

  4. To run this demo HTML IE Sample automation process has to be launched. To run the demo of OpenAI information extraction from the corresponding Document Processor steps need to be launched from IE_HTML_OPENAI_SAMPLE document set. 
  5. For OpenAI information extraction from loan agreements, the required Document Processor steps must be initiated from the IE_HTML_LOAN_OPENAI_SAMPLE document set. The default prompt configuration is designed not only to extract information but also to summarize specific paragraphs of the loan agreements.

IeHTML Sample Process Package structure

Folder

Description

HTML IE Sample

HTML IE Sample automation process

HTML IE Document Processor

Standard HTML information extraction automation process

IE_HTML_SAMPLE_DOCUMENTS

Data Store for information extraction results

IE_HTML_SAMPLE

Information Extraction document set. Contains invoice test samples

IE_HTML_OPENAI_SAMPLE

Information Extraction document set for OpenAI model. Contains invoice test samples

IE_HTML_LOAN_OPENAI_SAMPLE

Information Extraction document set for OpenAI model. Contains load agreement test samples

HTML IE Invoice

Invoice information extraction document type. Defines the entities to be extracted from invoices.

HTML IE Loan Agreement OpenAI Sample

Loan Agreement information extraction document type. Defines the entities to be extracted from loan agreements.

HTML Information Extraction Task

Information Extraction human task type. Defines the task input form in the Workspace

html_ie_invoice_model-1.2.tar

Information Extraction IE model

ml_iehtml_openai_model.tar

Information Extraction IE OpenAI model

storage/data

Folder that contains documents to be uploaded in File Storage

Configuration Parameters for IeHTML Sample Automation Process:

Key

Default Value

Description

inputFolder

iehtml_sample/input

File Storage folder where input documents are stored.

fileFilter

.*\.html

Regular expression for files to select.

configuration

{
  "DEFAULT": {

   "dataStore": "IE_HTML_SAMPLE_DOCUMENTS",
    "documentType": "HTML IE Sample",
    "model": "html_ie_invoice_model",
    "runModel": "html_ie_invoice_model,1.2",
    "storagePath": "iehtml_sample",
    "exportDocumentSet": "IE_HTML_SAMPLE",
    "bucket": "data"
  }

dataStore - the datastore name where to store input documents

documentType - the document type name to use for classification.

runModel - the model name and version to run

storagePath - the document path on a storage to use

Json parameter that provides mapping of document types and corresponding ML models and contains model name, model version and document type name of each model. 

Included Steps

Step 1. Ingest Documents

RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed. The status of the document which has just been extracted for processing is 'NEW'.

When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.

After this step a separate workflow of RPA and ML tasks is created for each document.

Step 2. Prepare Documents

On this step input data for ML model execution is prepared. Files created as a result of the original document processing are saved to the same iehtml_sample File Storage folder where the original document is stored.

Step 3. Extract Data

The ML Information Extraction model is employed to extract the specific fields from HTML invoice order documents.

Step 4. Verify Extracted Data

This step enables human verification and corrections to ensure accuracy of data extracted. After the relevant business entities have been extracted from a document, a human task is created and needs to be completed in Workspace. It contains ML Information Extraction model output that humans can review, validate and correct.