Classification Models
Classification Models
Overview
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically.
EasyRPA includes implementation of both Single-label Classification and Multi-label Classification.
Let's see the difference of these two methods on a very simple example and assume that we have 3 categories of documents. No matter what type of classification is used, the model's response for a single document is always a vector of 3 values:
- Single-label classification - each category is assigned a value from 0 to 1; sum of all probabilities equals 1
- Multi-label classification - each category is assigned a value from 0 to 1 independently of the others; therefore, sum of all probabilities is in 0..3 range
For example, we classify articles by three categories: sports, business, press. An article about a sports magazine may potentially have the following score:
- Single-label classification - [sports = 0.3, business = 0.0, press = 0.7]
- Multi-label classification - [sports = 0.8, business = 0.0, press = 0.9]
Curentlly platform has the same sets of models that Information Extraction Models, so refer to the article below for input json structure:
- hOCR source base
- HTML source base
Spacy CL Models
Platform uses spacy NLP inside for data processing for the following models:
- ml_cl_spacy3_model
- ml_clhtml_spacy3_model
Document Classification as a Pipeline
Taking a closer look at both processes, let's investigate what stages are part of each: model training and execution.
Model Training Process
Model training
This step of EasyRPA involves training the ML model using the provided training set. The system runs training for a specified number of iterations and selects the best model.
Process developer can specify the model type (single vs multi-label), the number of iterations, etc. using a configuration JSON file.
Package creation
The Spacy model comes packaged with configuration files and uploaded to the Nexus repo.
Classification Process
Model execution
The classification model is executed for a single document.
The result of model execution is a JSON with two sections
- scores section contains key-value pairs with category names and their probabilities
- categories section contains a single category, in case of single-label classification a list of categories whose probability is higher than set in the Document type Settings "scoreTreshold".
Model Training Configuration File
To train a Classification model you need to provide a JSON that defines configuration parameters for the training process.
Let's take a closer look at these configuration settings.
The developer can specify model settings in the configuration file. This file is required and has the following settings:
- ocr_fixes(list of objects)(optional) - values that should be replaced with other values are defined here. In the example below value "G4LD" will be replaced with the value "64LD".
- trainer_name(string)(required) - a python artifact that produces model packages for processing with a specific model type. There are two modules in it: a module for training on tagged data and generating a trained model package, and a module for downloading the trained model from the Nexus or from the cache and running it on the input data. Please, refer to Out of the box CL models and Out of the box CLHTML models for more details.
- trainer_version(string)(required) - a trainer version. Please, refer to Out of the box CL models and Out of the box CLHTML models for more details.
- trainer_description(string)(required) - a trainer description.
- lang(string)(optional) - the language of input data. The default value is 'en'.
- lemmatization(boolean)(optional) - technique which is used to reduce words to a normalized form. In lemmatization, the transformation uses a pre-trained spacy models components (part-of-speech, context, normalization table) to map different variants of a word back to its root format. So, with this approach, we are able to reduce non trivial inflections such as “is”, “was”, “were” back to the root “be”. Lemmatization can potentially increase the quality of the classification model. The default value is 'false'.
- iterations(number)(optional) - number of iterations of model training on a given training set. The default value is '30'.
Model Training Data File
To train a Spacy Information Extraction models system provides train_data.json.
Where:
- categories(list of strings)(optional) - the categories to which a document can belong.
- single_label(boolean)(optional) - the type of classification model (multi-label or single-label classification). The default value is 'true'.
OpenAI CL Models
The OpenAI model uses OpenAI API to call LLM for request processing. The idea of such models are minifies (depending of specified renderer in model configuration) input document (HTML or HOCR), then send request to OpenAI that clasiffy the document.
Curentlly platform has the following OpenAI CL models:
- ml_cl_openai_model - uses hOCR source base
- ml_clhtml_openai_model - uses HTML source base
The models configuration very similar to the IE OpenAI ones, so we ommit the common configurations and provide only diferences.
ml_cl_openai_model
Here is a default models chat with OpenAI:
You are an expert in HTML document classification. You have the following categories (name per line): ``` Invoice Remittance Advice ``` Your answer must be a JSON object where each key is a category and each value is a double (between 0 and 1) representing the probability the document belongs to that category. Ensure that the sum of all scores across categories for a document is always equal to 1. The input HTML document to classify is: ```html {html} ```
Model Training
Here is the models default training config:
where:
- prompts_config - the default prompts configuration saved into trained model
- messages - a prompt messages structure to use during sending to OpenAI API
- html - the document simplified html that model creates and injected into prompt context
- environment - a secret vault aliace where stored JSON with environment variables to set, before call the LLM API
temperature - the request temperature, depends of LLM model, ussually can be gradated like: Coding / Math - 0.0; Data Cleaning / Data Analysis 1.0; Creative Writing / Poetry - 1.5
- open_ai_model - an OpenAI model to use, required
- track_into_langfuse - track the OpenAI conversation into Langfuse if true
- debug - boolean switches debug messages on
hocr2html - HOCR to html rendering configuration
HOCR to HTML rendering configuration
The hocr2html renderind algoritms are the similar to the ml_ie_openai_model model, but the only the following rendering exist:
- default - put word in a order htat is exist in HOCR
- table - put words according to recognized table layout
Sample
The Intelligent Document Processing (IDP) contains document set IDP_SAMPLE_CLASSIFICATION_OPENAI that configured to work with the ml_cl_openai_model
ml_clhtml_openai_model
Here is a default models chat with OpenAI:
You are an expert in document classification. You have the following categories (name per line): ``` Application invoice Payment Details ``` Your answer must be a JSON object where each key is a category and each value is a double (between 0 and 1) representing the probability the document belongs to that category. Ensure that the sum of all scores across categories for a document is always equal to 1. Your input to classify is: ```txt {text} ```
HTML to Text rendering
To classify a document we only need a text, and because of html to text is a trivial operation, so there is no any additional configuration here.
Model Training
Here is the models default training config:
where:
- prompts_config - the default prompts configuration saved into trained model
- messages - a prompt messages structure to use during sending to OpenAI API
- text - the documents text that model creates from HTML and injected into prompt context
- environment - a secret vault aliace where stored JSON with environment variables to set, before call the LLM API
temperature - the request temperature, depends of LLM model, ussually can be gradated like: Coding / Math - 0.0; Data Cleaning / Data Analysis 1.0; Creative Writing / Poetry - 1.5
- open_ai_model - an OpenAI model to use, required
- track_into_langfuse - track the OpenAI conversation into Langfuse if true
- debug - boolean switches debug messages on
Sample
The Classification HTML Sample contains document set CL_HTML_OPENAI SAMPLE that configured to work with the ml_clhtml_openai_model