Classification Models
Classification Models
Overview
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically.
EasyRPA includes implementation of both Single-label Classification and Multi-label Classification.
Let's see the difference of these two methods on a very simple example and assume that we have 3 categories of documents. No matter what type of classification is used, the model's response for a single document is always a vector of 3 values:
- Single-label classification - each category is assigned a value from 0 to 1; sum of all probabilities equals 1
- Multi-label classification - each category is assigned a value from 0 to 1 independently of the others; therefore, sum of all probabilities is in 0..3 range
For example, we classify articles by three categories: sports, business, press. An article about a sports magazine may potentially have the following score:
- Single-label classification - [sports = 0.3, business = 0.0, press = 0.7]
- Multi-label classification - [sports = 0.8, business = 0.0, press = 0.9]
Document Classification as a Pipeline
Taking a closer look at both processes, let's investigate what stages are part of each: model training and execution.
Model Training Process
Model training
This step of EasyRPA involves training the ML model using the provided training set. The system runs training for a specified number of iterations and selects the best model.
Process developer can specify the model type (single vs multi-label), the number of iterations, etc. using a configuration JSON file.
Package creation
The Spacy model comes packaged with configuration files and uploaded to the Nexus repo.
Classification Process
Model execution
The classification model is executed for a single document.
The result of model execution is a JSON with two sections
- scores section contains key-value pairs with category names and their probabilities
- categories section contains a single category, in case of single-label classification a list of categories whose probability is higher than set in the Document type Settings "scoreTreshold".
Model Training Configuration File
To train a Classification model you need to provide a JSON that defines configuration parameters for the training process.
Let's take a closer look at these configuration settings.
The developer can specify model settings in the configuration file. This file is required and has the following settings:
- ocr_fixes(list of objects)(optional) - values that should be replaced with other values are defined here. In the example below value "G4LD" will be replaced with the value "64LD".
- trainer_name(string)(required) - a python artifact that produces model packages for processing with a specific model type. There are two modules in it: a module for training on tagged data and generating a trained model package, and a module for downloading the trained model from the Nexus or from the cache and running it on the input data. Please, refer to Out of the box CL models and Out of the box CLHTML models for more details.
- trainer_version(string)(required) - a trainer version. Please, refer to Out of the box CL models and Out of the box CLHTML models for more details.
- trainer_description(string)(required) - a trainer description.
- single_label(boolean)(optional) - the type of classification model (multi-label or single-label classification). The default value is 'true'.
- lang(string)(optional) - the language of input data. The default value is 'en'.
- lemmatization(boolean)(optional) - technique which is used to reduce words to a normalized form. In lemmatization, the transformation uses a pre-trained spacy models components (part-of-speech, context, normalization table) to map different variants of a word back to its root format. So, with this approach, we are able to reduce non trivial inflections such as “is”, “was”, “were” back to the root “be”. Lemmatization can potentially increase the quality of the classification model. The default value is 'false'.
- iterations(number)(optional) - number of iterations of model training on a given training set. The default value is '30'.
- categories(list of strings)(optional) - the categories to which a document can belong.