Classification HTML Sample Process (ClHTMLSample)
Classification HTML Sample Process (ClHTMLSample)
Overview
ClHTML Sample performs HTML documents classification - determines which of the following classes a document belongs to using the Classification ML model:
- Application invoice
- Payment Details
ClHTML Sample Lifecycle includes:
Prerequisites
In order to successfully set up and run ClHTML Sample Process:
- Ensure that you have a running node with the "AP_RUN" capabilities.
- Upload the ClHTML Sample package to the Control Server. The package can be found in the following directory: http://<CS host>/nexus/repository/rpaplatform/eu/ibagroup/samples/ap/easy-rpa-clhml-ap/<EasyRPA version>/easy-rpa-clhml-ap-<EasyRA version>-bin.zip
The source code can be found here: https://code.easyrpa.eu/easyrpa/easy-rpa-samples/-/tree/dev/easy-rpa-ml-aps/easy-rpa-clhml-ap Ensure the following details are provided for the ClHTML automation process in the Automation Process Details tab
Module class: eu.ibagroup.sample.ml.clhtml.ClHtmlSample
Group Id: eu.ibagroup.samples.ap
Artifact Id: easy-rpa-clhml-ap
Version Id: <EasyRPA version>
ClHTML Sample Process Package structure
Folder
Description
HTML CL Sample
ClHTML Sample automation process
HTML CL Document Processor
Standard HTML classification automation process
CL_HTML_SAMPLE
Classification document set. Contains HTML test samples
HTML Classification Sample
HTML document classification document type. Defines the classes of the documents to determine
HTML Classification Task
Classification human task type. Defines the task input form in the Workspace
easy-rpa-clhml-ap-<EasyRPA version>.jar
Root archive and dependencies. Contains code of ClHTML Sample automation process
easy-rpa-clhml-ap-<EasyRPA version>.jar
Root archive and dependencies. Contains code of ClHTML Sample automation process easy-rpa-clhtmldp-ap-<version>.tar.gz
Classification ML model
storage/data
Folder that contains documents to be uploaded in File Storage
Configuration Parameters for IeHTML Sample Automation Process:
Key
Default Value
Description
inputFolder
clhtml_sample/input
File Storage folder where input documents are stored.
fileFilter .*\.html Regular expression for files to select. configuration {
"html-cl": {
"dataStore": "CL_HTML_SAMPLE_DOCUMENTS",
"documentType": "HTML Classification Sample",
"model": "html_cl_model",
"runModel": "html_cl_model,1.1",
"storagePath": "clhtml_sample",
"exportDocumentSet": "CL_HTML_SAMPLE",
"bucket": "data"
}
}dataStore - the datastore name where to store input documents
documentType - the document type name to use for classification. The document type setting parameter scoreThreshold defines the lower classification result score limit above which the document category will be selected as correct
runModel - the model name and version to run
storagePath - the document path on a storage to use
Included Steps
Step 1. Ingest Documents
RPA bot extracts documents from the dedicated folder in File Storage. It compiles a list of documents to be processed. The status of the document which has just been extracted for processing is 'NEW'.
When a list of documents has been generated RPA bot prepares batches of documents for processing. The number of documents in a batch is determined by the configuration parameter batchSize.
After this step a separate workflow of RPA and ML tasks is created for each document.
Step 2. Prepare Documents
On this step input data for ML model execution is prepared. Files created as a result of the original document processing are saved to the same ADD File Storage folder where the original document is stored.
Step 3. Extract Data
Once documents are prepared, a pre-trained ML Classification model is executed to predict a document’s category. The ML Classification model also provides a confidence measure indicating how confident it is that the assigned classification tag is correct. The confidence score threshold is provided in the settings of the Document type.
Step 4. Verify Extracted Data
This step enables human verification and corrections to ensure the accuracy of classification. After the relevant classes have been defined for a document, a human task is created and needs to be completed in Workspace. It contains ML Classification model output that humans can review, validate and correct.