Skip to main content

Concepts and Entities

Concepts and Entities

Machine learning (ML) is a method of data analysis that automates analytical model building. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

EasyRPA platform provides developers with infrastructure to train ML models and use for data processing.

Let's look at EasyRPA artifacts related to Machine Learning area.

Document Sets

Work on ML models starts with data - preferably, lots of data (documents) for which the target answer known. When the target answer is assigned the documents are called labeled or tagged data. For example, for the email classification problem, the target is a label that indicates whether an email is spam or not spam. The ML algorithm teaches itself to learn from the labeled examples that we provide.

Often, data is not readily available in a labeled form. Collecting and labeling data is often the most important step in solving an ML problem. The example data should be representative of the data that you will have when you are using the model to make a prediction. For example, if you want to predict whether an email is spam or not, you must collect both positive (spam emails) and negative (non-spam emails) for the machine learning algorithm to be able to find patterns that will distinguish between the two types of email.

Once you have the labelled data, you might need to convert it to a format that is acceptable to your algorithm or software - a file called training set. The next step is actually train a model. This process invokes the ML algorithm with training data and all the parameters and configurations required, as the result ML model is created and training log is stored in the system.

Human Task Type

Human Task Type (HTT) is a special javascript application (provided as a zip package) that controls life-cycle of document processing by human, it handles all UI activities on document to prepare outputJson from inputJson in Workspace.

Every HTT defines its own input/output JSON format and has uniquie HTT type id.

Platforms Human Task Types

HTT NameHTT typeinputJson typeoutputJson typeDescription
Classification TaskclassificationhOCR input jsoncl output jsonHandles OCRed document classification scenario
Information Extraction TaskiehOCR input jsonIE hOCR output jsonHandles OCRed document information extraction scenario
HTML Classification Taskhtml-classificationHTML input jsoncl output jsonHandles HTML document classification scenario
HTML Information Extraction Taskhtml-ieHTML input jsonIE HTML output jsonHandles HTML document information extraction scenario
Form Taskformfofms input jsonfofms output jsonHandles forms scenario

Nexus Link: https://<CS URL>/nexus/repository/rpaplatform/eu/ibagroup/easy-rpa-aps/<version>/easy-rpa-aps-<version>-bin.zip

Please note that these Human Task Types are not loaded into Control Server by default. They are contained in automation process packages mentioned in the table above ("Automation Process Nexus Link" column). In order to upload only Human Task Type, please, upload corresponding automation process package and in the 'Please, resolve conflicts' window skip all the entities except Human Task Type. For more information, please visit Upload Automation Process Package.

Document Processor

Document Processor is a special automation process that controls life-cycle of document sets and handles all data transformations required to collect data and apply OCR, label documents in Workspace, prepare training data sets and train models.

See more details in Document Processors article.

Platforms Document Processors

AP NameHTT typeDescription
CL Document ProcessorclassificationHandles image/pdf OCR in hOCR format and IE tagging
IE Document ProcessorieHandles image/pdf OCR in hOCR format and classification
HTML CL Document Processorhtml-classificationHandles html format classification.
HTML IE Document Processorhtml-ieHandles html and txt input format, converts txt to html for IE tagging.

Nexus Link: https://<CS URL>/nexus/repository/rpaplatform/eu/ibagroup/easy-rpa-aps/<version>/easy-rpa-aps-<version>-bin.zip

Please note that these Document Processors are not loaded into Control Server by default. Please download required document processor from your Nexus using the links in the table above and load *.zip file into Control Server. For more information on how to upload .zip file to Control Server, please visit Upload Automation Process Package.

ML Container

ML Container is a service of the platform that is responsible for two main tasks: 

  • Model Training: The Control Server sends a train request to the ML container that is typically contains a training set, a model configuration files, training parameters and so on. ML Container selects training algorithm suitable for the specified model type, performs model training process for the specified parameters. The training process result is a package that is uploaded into Model Repository.
  • Model ExecutionThe Control Server sends a process request to the ML container that is typically specifies a model to be executed and input JSON. ML Container retrieves the specified model from the model repository, if required, and executes it. The result of execution is placed in output JSON.

The ML Container is a separate docker container that communicates with the Control Server via a message queue.

Model Repository

The Model Repository is the platforms nexus pypi-group python repository that stores all the ML artifacts. It is group repository that covers the following repositories:

  • pypi-proxy - proxies the global https://pypi.org/
  • pypi-rpaplatform - contains platform ML artifacts that can not be overrided by users
  • rpa-model - contains all user ML artifacts, the CS uses it to upload new ML artifacts

ML Models

The platform defines the following concepts:

  • trainer (model trainer) - is a python package that is executed in platforms ML container. It implements platforms Model Training and Model Execution flows. Its name/version is specified in model training.
  • model (trained model) - a python package that is created by a trainer during training process. It refers its trainer inside, and together they are provides data during the data processing. Its name/version is specified in data extraction.

Each trainer/model is uniquely identified by its name and version. The platforms ML artifacts are placed into pypi-rpaplatform model repository. The CS also provides a way to export existing models as a file and import pre-trained models into the system.

Every model works with specific input/output data format. Because of main intention to call model is to perform human work, the platform provides the model for the Human Task Types listed above. They are linked by HTT type.

The  model also classified by the following types:

Platforms Classification Models

TrainerVersionHTT typeNotes
ml_cl_spacy3_model3.2.0classification

Dymanic language loading support

Added lemmatization using pre-trained spacy models

Spacy3.1.1

ml_clhtml_spacy3_model3.2.0html-classification

Dymanic language loading support

Added lemmatization using pre-trained spacy models

Spacy3.1.1

Platforms Information Extraction Models

TrainerVersionHTT typeNotes
ml_ie_spacy2_model3.2.0

ie

Uses Spacy 2.3.4, for IE, Supports dymanic language loading.

ml_ie_spacy3_model3.2.0

ie

Uses Spacy 3.1.1, for IE, Supports dymanic language loading.

ml_ie_openai_model3.2.0

ie

OpenAI Information Extraction using html representation of hOCR document. Could be used as a trained model, in this case it recives OpenAI prompt in the MlTask configuration.

ml_iehtml_spacy2_model3.2.0

html-ie

Uses Spacy 2.3.4, for IE, Supports dymanic language loading.

ml_iehtml_spacy3_model3.2.0

html-ie

Uses Spacy 3.1.1, for IE, Supports dymanic language loading.

Platforms Custom Models

All the custom models are provided as already trained (do not support training process), so it name/version should be used for extraction process.

TrainerVersionHTT typeDescription
ml_ie_finext_model3.2.0

ie

Financial Information Model - FINIX, Trainer and trained model

ml_signature_detection_yolo5_model3.2.0

ie

Signature detection Model, Trainer and trained model

Nexus Link: https://<CS URL>/nexus/repository/rpaplatform/eu/ibagroup/easy-rpa-aps/<version>/easy-rpa-aps-<version>-bin.zip

Please note that ML Models are not loaded into Control Server by default. Please download required model from your Nexus using the links in the table above and load *.zip file into Control Server. For more information on how to upload .zip file to Control Server, please visit Upload Automation Process Package.