Concepts and Entities
Concepts and Entities
Machine learning (ML) is a method of data analysis that automates analytical model building. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
EasyRPA platform provides developers with infrastructure to train ML models and use for data processing.
Let's look at EasyRPA artifacts related to Machine Learning area.
Document Sets
Work on ML models starts with data - preferably, lots of data (documents) for which the target answer known. When the target answer is assigned the documents are called labeled or tagged data. For example, for the email classification problem, the target is a label that indicates whether an email is spam or not spam. The ML algorithm teaches itself to learn from the labeled examples that we provide.
Often, data is not readily available in a labeled form. Collecting and labeling data is often the most important step in solving an ML problem. The example data should be representative of the data that you will have when you are using the model to make a prediction. For example, if you want to predict whether an email is spam or not, you must collect both positive (spam emails) and negative (non-spam emails) for the machine learning algorithm to be able to find patterns that will distinguish between the two types of email.
Once you have the labelled data, you might need to convert it to a format that is acceptable to your algorithm or software - a file called training set. The next step is actually train a model. This process invokes the ML algorithm with training data and all the parameters and configurations required, as the result ML model is created and training log is stored in the system.
Human Task Type
Human Task Type (HTT) is a special javascript application (provided as a zip package) that controls life-cycle of document processing by human, it handles all UI activities on document to prepare outputJson from inputJson in Workspace.
Every HTT defines its own input/output JSON format and has uniquie HTT type id.
Platforms Human Task Types
HTT Name | HTT type | inputJson type | outputJson type | Description |
---|---|---|---|---|
Classification Task | classification | hOCR input json | cl output json | Handles OCRed document classification scenario |
Information Extraction Task | ie | hOCR input json | IE hOCR output json | Handles OCRed document information extraction scenario |
HTML Classification Task | html-classification | HTML input json | cl output json | Handles HTML document classification scenario |
HTML Information Extraction Task | html-ie | HTML input json | IE HTML output json | Handles HTML document information extraction scenario |
Form Task | form | fofms input json | fofms output json | Handles forms scenario |
Please note that these Human Task Types are not loaded into Control Server by default. They are contained in automation process packages mentioned in the table above ("Automation Process Nexus Link" column). In order to upload only Human Task Type, please, upload corresponding automation process package and in the 'Please, resolve conflicts' window skip all the entities except Human Task Type. For more information, please visit Upload Automation Process Package.
Document Processor
Document Processor is a special automation process that controls life-cycle of document sets and handles all data transformations required to collect data and apply OCR, label documents in Workspace, prepare training data sets and train models.
See more details in Document Processors article.
Platforms Document Processors
AP Name | HTT type | Description |
---|---|---|
CL Document Processor | classification | Handles image/pdf OCR in hOCR format and IE tagging |
IE Document Processor | ie | Handles image/pdf OCR in hOCR format and classification |
HTML CL Document Processor | html-classification | Handles html format classification. |
HTML IE Document Processor | html-ie | Handles html and txt input format, converts txt to html for IE tagging. |
Please note that these Document Processors are not loaded into Control Server by default. Please download required document processor from your Nexus using the links in the table above and load *.zip file into Control Server. For more information on how to upload .zip file to Control Server, please visit Upload Automation Process Package.
ML Container
ML Container is a service of the platform that is responsible for two main tasks:
- Model Training: The Control Server sends a train request to the ML container that is typically contains a training set, a model configuration files, training parameters and so on. ML Container selects training algorithm suitable for the specified model type, performs model training process for the specified parameters. The training process result is a package that is uploaded into Model Repository.
- Model Execution: The Control Server sends a process request to the ML container that is typically specifies a model to be executed and input JSON. ML Container retrieves the specified model from the model repository, if required, and executes it. The result of execution is placed in output JSON.
The ML Container is a separate docker container that communicates with the Control Server via a message queue.
Model Repository
The Model Repository is the platforms nexus pypi-group python repository that stores all the ML artifacts. It is group repository that covers the following repositories:
- pypi-proxy - proxies the global https://pypi.org/
- pypi-rpaplatform - contains platform ML artifacts that can not be overrided by users
- rpa-model - contains all user ML artifacts, the CS uses it to upload new ML artifacts
ML Models
The platform defines the following concepts:
- trainer (model trainer) - is a python package that is executed in platforms ML container. It implements platforms Model Training and Model Execution flows. Its name/version is specified in model training.
- model (trained model) - a python package that is created by a trainer during training process. It refers its trainer inside, and together they are provides data during the data processing. Its name/version is specified in data extraction.
Each trainer/model is uniquely identified by its name and version. The platforms ML artifacts are placed into pypi-rpaplatform model repository. The CS also provides a way to export existing models as a file and import pre-trained models into the system.
Every model works with specific input/output data format. Because of main intention to call model is to perform human work, the platform provides the model for the Human Task Types listed above. They are linked by HTT type.
The model also classified by the following types:
- Classification Models
- Information Extraction Models (IE)
- Custom models
Platforms Classification Models
Trainer | Version | HTT type | Notes |
---|---|---|---|
ml_cl_spacy3_model | 3.2.0 | classification | Dymanic language loading support Added lemmatization using pre-trained spacy models Spacy3.1.1 |
ml_clhtml_spacy3_model | 3.2.0 | html-classification | Dymanic language loading support Added lemmatization using pre-trained spacy models Spacy3.1.1 |
Platforms Information Extraction Models
Trainer | Version | HTT type | Notes |
---|---|---|---|
ml_ie_spacy2_model | 3.2.0 | ie | Uses Spacy 2.3.4, for IE, Supports dymanic language loading. |
ml_ie_spacy3_model | 3.2.0 | ie | Uses Spacy 3.1.1, for IE, Supports dymanic language loading. |
ml_ie_openai_model | 3.2.0 | ie | OpenAI Information Extraction using html representation of hOCR document. Could be used as a trained model, in this case it recives OpenAI prompt in the MlTask configuration. |
ml_iehtml_spacy2_model | 3.2.0 | html-ie | Uses Spacy 2.3.4, for IE, Supports dymanic language loading. |
ml_iehtml_spacy3_model | 3.2.0 | html-ie | Uses Spacy 3.1.1, for IE, Supports dymanic language loading. |
Platforms Custom Models
All the custom models are provided as already trained (do not support training process), so it name/version should be used for extraction process.
Trainer | Version | HTT type | Description |
---|---|---|---|
ml_ie_finext_model | 3.2.0 | ie | Financial Information Model - FINIX, Trainer and trained model |
ml_signature_detection_yolo5_model | 3.2.0 | ie | Signature detection Model, Trainer and trained model |
Please note that ML Models are not loaded into Control Server by default. Please download required model from your Nexus using the links in the table above and load *.zip file into Control Server. For more information on how to upload .zip file to Control Server, please visit Upload Automation Process Package.