Skip to main content

Classification HTML Sample Process (ClHTMLSample)

Classification HTML Sample Process (ClHTMLSample)

Overview

ClHTML Sample performs HTML documents classification - determines which of the following classes a document belongs to using the Classification ML model:

  • Application invoice
  • Payment Details

ClHTML Sample Lifecycle includes:

Prerequisites

In order to successfully set up and run ClHTML Sample Process:

  1. Ensure that you have a running node with the "AP_RUN" capabilities.
  2. Upload the ClHTML Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-clhml-ap/<EasyRPA version>/easy-rpa-clhml-ap-<EasyRA version>-bin.zip
    The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-clhml-ap
  3. Ensure the following details are provided for the ClHTML automation process in the Automation Process Details tab

    Module class: eu.ibagroup.sample.ml.clhtml.ClHtmlSample

    Group Id: eu.ibagroup.samples.ap

    Artifact Id: easy-rpa-clhml-ap

    Version Id: <EasyRPA version>

    ClHTML Sample Process Package structure

    Folder

    Description

    HTML CL Sample

    ClHTML Sample automation process

    HTML CL Document Processor

    Standard HTML classification automation process

    CL_HTML_SAMPLE

    Classification document set. Contains HTML test samples

    HTML Classification Sample

    HTML document classification document type.  Defines the classes of the documents to determine

    HTML Classification Task

    Classification human task type. Defines the task input form in the Workspace

    easy-rpa-clhml-ap-<EasyRPA version>.jar

    Root archive and dependencies. Contains code of ClHTML Sample automation process

    easy-rpa-clhml-ap-<EasyRPA version>.jar

    Root archive and dependencies. Contains code of ClHTML Sample automation process

    easy-rpa-clhtmldp-ap-<version>.tar.gz

    Classification ML model

    storage/data

    Folder that contains documents to be uploaded in File Storage

    Configuration Parameters for IeHTML Sample Automation Process:

    Key

    Default Value

    Description

    inputFolder

    clhtml_sample/input

    File Storage folder where input documents are stored.

    fileFilter.*\.htmlRegular expression for files to select.
    configuration{
        "html-cl": {
            "dataStore": "CL_HTML_SAMPLE_DOCUMENTS",
            "documentType": "HTML Classification Sample",
            "model": "html_cl_model",
            "runModel": "html_cl_model,1.1",
            "storagePath": "clhtml_sample",
            "exportDocumentSet": "CL_HTML_SAMPLE",
            "bucket": "data"
        }
    }

    dataStore - the datastore name where to store input documents

    documentType - the document type name to use for classification. The document type setting parameter scoreThreshold defines the lower classification result score limit above which the document category will be selected as correct

    runModel - the model name and version to run

    storagePath - the document path on a storage to use

    Included Steps

    Step 1. Ingest Documents

    RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed. The status of the document which has just been extracted for processing is 'NEW'.

    When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.

    After this step a separate workflow of RPA and ML tasks is created for each document.

    Step 2. Prepare Documents

    On this step input data for ML model execution is prepared. Files created as a result of the original document processing are saved to the same ADD File Storage folder where the original document is stored.

    Step 3. Extract Data

    Once documents are prepared, a pre-trained ML Classification model is executed to predict a document’s category. The ML Classification model also provides a confidence measure indicating how confident it is that the assigned classification tag is correct. The confidence score threshold is provided in the settings of the Document type.

    Step 4. Verify Extracted Data

    This step enables human verification and corrections to ensure the accuracy of classification. After the relevant classes have been defined for a document, a human task is created and needs to be completed in Workspace. It contains ML Classification model output that humans can review, validate and correct.