Machine learning (ML) is a method of data analysis that automates analytical model building. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

EasyRPA platform provides developers with infrastructure to train ML models and use for data processing.

Let's look at EasyRPA artifacts related to Machine Learning area.

Document Sets
Human Task Type
- Platforms Human Task Types
Document Processor
- Platforms Document Processors
ML Container
Model Repository
ML Models

Document Sets

Work on ML models starts with data - preferably, lots of data (documents) for which the target answer known. When the target answer is assigned the documents are called labeled or tagged data. For example, for the email classification problem, the target is a label that indicates whether an email is spam or not spam. The ML algorithm teaches itself to learn from the labeled examples that we provide.

Often, data is not readily available in a labeled form. Collecting and labeling data is often the most important step in solving an ML problem. The example data should be representative of the data that you will have when you are using the model to make a prediction. For example, if you want to predict whether an email is spam or not, you must collect both positive (spam emails) and negative (non-spam emails) for the machine learning algorithm to be able to find patterns that will distinguish between the two types of email.

Once you have the labelled data, you might need to convert it to a format that is acceptable to your algorithm or software - a file called training set. The next step is actually train a model. This process invokes the ML algorithm with training data and all the parameters and configurations required, as the result ML model is created and training log is stored in the system.

Human Task Type

Human Task Type (HTT) is a special javascript application (provided as a zip package) that controls life-cycle of document processing by human, it handles all UI activities on document to prepare outputJson from inputJson in Workspace.

Every HTT defines its own input/output JSON format and has uniquie HTT type id.

Platforms Human Task Types

HTT Name	HTT type	inputJson type	outputJson type	Description
Classification Task	classification	hOCR input json	cl output json	Handles OCRed document classification scenario
Information Extraction Task	ie	hOCR input json	IE hOCR output json	Handles OCRed document information extraction scenario
HTML Classification Task	html-classification	HTML input json	cl output json	Handles HTML document classification scenario
HTML Information Extraction Task	html-ie	HTML input json	IE HTML output json	Handles HTML document information extraction scenario
Form Task	form	fofms input json	fofms output json	Handles forms scenario

Nexus Link: https://<CS URL>/nexus/repository/rpaplatform/eu/ibagroup/easy-rpa-aps/<version>/easy-rpa-aps-<version>-bin.zip

Please note that these Human Task Types are not loaded into Control Server by default. They are contained in automation process packages mentioned in the table above ("Automation Process Nexus Link" column). In order to upload only Human Task Type, please, upload corresponding automation process package and in the 'Please, resolve conflicts' window skip all the entities except Human Task Type. For more information, please visit Upload Automation Process Package.

Document Processor

Document Processor is a special automation process that controls life-cycle of document sets and handles all data transformations required to collect data and apply OCR, label documents in Workspace, prepare training data sets and train models.

See more details in Document Processors article.

Platforms Document Processors

AP Name	HTT type	Description
CL Document Processor	classification	Handles image/pdf OCR in hOCR format and IE tagging
IE Document Processor	ie	Handles image/pdf OCR in hOCR format and classification
HTML CL Document Processor	html-classification	Handles html format classification.
HTML IE Document Processor	html-ie	Handles html and txt input format, converts txt to html for IE tagging.

Nexus Link: https://<CS URL>/nexus/repository/rpaplatform/eu/ibagroup/easy-rpa-aps/<version>/easy-rpa-aps-<version>-bin.zip

Please note that these Document Processors are not loaded into Control Server by default. Please download required document processor from your Nexus using the links in the table above and load *.zip file into Control Server. For more information on how to upload .zip file to Control Server, please visit Upload Automation Process Package.

ML Container

ML Container is a service of the platform that is responsible for two main tasks:

Model Training: The Control Server sends a train request to the ML container that is typically contains a training set, a model configuration files, training parameters and so on. ML Container selects training algorithm suitable for the specified model type, performs model training process for the specified parameters. The training process result is a package that is uploaded into Model Repository.
Model Execution: The Control Server sends a process request to the ML container that is typically specifies a model to be executed and input JSON. ML Container retrieves the specified model from the model repository, if required, and executes it. The result of execution is placed in output JSON.

The ML Container is a separate docker container that communicates with the Control Server via a message queue.

Model Repository

The Model Repository is the platforms nexus pypi-group python repository that stores all the ML artifacts. It is group repository that covers the following repositories:

pypi-proxy - proxies the global https://pypi.org/
pypi-rpaplatform - contains platform ML artifacts that can not be overrided by users
rpa-model - contains all user ML artifacts, the CS uses it to upload new ML artifacts

ML Models

The platform defines the following concepts:

trainer (model trainer) - is a python package that is executed in platforms ML container. It implements platforms Model Training and Model Execution flows. Its name/version is specified in model training.
model (trained model) - a python package that is created by a trainer during training process. It refers its trainer inside, and together they are provides data during the data processing. Its name/version is specified in data extraction.

Each trainer/model is uniquely identified by its name and version. The platforms ML artifacts are placed into pypi-rpaplatform model repository. The CS also provides a way to export existing models as a file and import pre-trained models into the system.

Every model works with specific input/output data format. Because of main intention to call model is to perform human work, the platform provides the model for the Human Task Types listed above. They are linked by HTT type.

The model also classified by the following types:

Platforms Classification Models

Trainer	Version	HTT type	Notes
ml_cl_spacy3_model	3.3.0	classification	Dymanic language loading support Added lemmatization using pre-trained spacy models Spacy3.1.1
ml_cl_openai_model	3.3.0	classification	Uses OpenAI to classify HOCR documents
ml_clhtml_spacy3_model	3.3.0	html-classification	Dymanic language loading support Added lemmatization using pre-trained spacy models Spacy3.1.1
ml_clhtml_openai_model	3.3.0	html-classification	Uses OpenAI to classify HTML documents

Platforms Information Extraction Models

Trainer	Version	HTT type	Notes
ml_ie_spacy2_model	3.3.0	ie	Uses Spacy 2.3.4, for IE, Supports dymanic language loading.
ml_ie_spacy3_model	3.3.0	ie	Uses Spacy 3.1.1, for IE, Supports dymanic language loading.
ml_ie_openai_model	3.3.0	ie	OpenAI Information Extraction using html representation of hOCR document. Could be used as a trained model, in this case it recives OpenAI prompt in the MlTask configuration.
ml_iehtml_spacy2_model	3.3.0	html-ie	Uses Spacy 2.3.4, for IE, Supports dymanic language loading.
ml_iehtml_spacy3_model	3.3.0	html-ie	Uses Spacy 3.1.1, for IE, Supports dymanic language loading.
ml_iehtml_openai_model	3.3.0	html-ie	OpenAI Information Extraction using minified html representation of source HTML document.

Platforms Custom Models

All the custom models are provided as already trained (do not support training process), so it name/version should be used for extraction process.

Trainer	Version	HTT type	Description
ml_ie_finext_model	3.3.0	ie	Financial Information Model - FINIX, Trainer and trained model
ml_signature_detection_yolo5_model	3.3.0	ie	Signature detection Model, Trainer and trained model

Nexus Link: https://<CS URL>/nexus/repository/rpaplatform/eu/ibagroup/easy-rpa-aps/<version>/easy-rpa-aps-<version>-bin.zip

Please note that ML Models are not loaded into Control Server by default. Please download required model from your Nexus using the links in the table above and load *.zip file into Control Server. For more information on how to upload .zip file to Control Server, please visit Upload Automation Process Package.

Concepts and Entities

Concepts and Entities

Document Sets

Human Task Type

Platforms Human Task Types

Document Processor

Platforms Document Processors

ML Container

Model Repository

ML Models

Platforms Classification Models

Platforms Information Extraction Models

Platforms Custom Models