Overview
Spacy CL Models
OpenAI CL Models
- ml_cl_openai_model
- ml_clhtml_openai_model

Overview

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically.

EasyRPA includes implementation of both Single-label Classification and Multi-label Classification.

Let's see the difference of these two methods on a very simple example and assume that we have 3 categories of documents. No matter what type of classification is used, the model's response for a single document is always a vector of 3 values:

Single-label classification - each category is assigned a value from 0 to 1; sum of all probabilities equals 1
Multi-label classification - each category is assigned a value from 0 to 1 independently of the others; therefore, sum of all probabilities is in 0..3 range

For example, we classify articles by three categories: sports, business, press. An article about a sports magazine may potentially have the following score:

Single-label classification - [sports = 0.3, business = 0.0, press = 0.7]
Multi-label classification - [sports = 0.8, business = 0.0, press = 0.9]

Curentlly platform has the same sets of models that Information Extraction Models, so refer to the article below for input json structure:

hOCR source base
HTML source base

Spacy CL Models

Platform uses spacy NLP inside for data processing for the following models:

ml_cl_spacy3_model
ml_clhtml_spacy3_model

Document Classification as a Pipeline

Taking a closer look at both processes, let's investigate what stages are part of each: model training and execution.

Model Training Process

Model training

This step of EasyRPA involves training the ML model using the provided training set. The system runs training for a specified number of iterations and selects the best model.

Process developer can specify the model type (single vs multi-label), the number of iterations, etc. using a configuration JSON file.

Package creation

The Spacy model comes packaged with configuration files and uploaded to the Nexus repo.

Classification Process

Model execution

The classification model is executed for a single document.

The result of model execution is a JSON with two sections

scores section contains key-value pairs with category names and their probabilities
categories section contains a single category, in case of single-label classification a list of categories whose probability is higher than set in the Document type Settings "scoreTreshold".

{
	"scores": {
		"Science": 0.7,
		"Sport": 0.3
	},
	"categories": [
		"Science"
	]
}

{
	"scores": {
		"Science": 0.1,
		"Sport": 0.8,
		"Business": 0.2,
		"Media": 0.6
	},
	"categories": [
		"Sport",
		"Media"
	]
}

Model Training Configuration File

To train a Classification model you need to provide a JSON that defines configuration parameters for the training process.

Let's take a closer look at these configuration settings.

The developer can specify model settings in the configuration file. This file is required and has the following settings:

ocr_fixes(list of objects)(optional) - values that should be replaced with other values are defined here. In the example below value "G4LD" will be replaced with the value "64LD".
trainer_name(string)(required) - a python artifact that produces model packages for processing with a specific model type. There are two modules in it: a module for training on tagged data and generating a trained model package, and a module for downloading the trained model from the Nexus or from the cache and running it on the input data. Please, refer to Out of the box CL models and Out of the box CLHTML models for more details.
trainer_version(string)(required) - a trainer version. Please, refer to Out of the box CL models and Out of the box CLHTML models for more details.
trainer_description(string)(required) - a trainer description.
lang(string)(optional) - the language of input data. The default value is 'en'.
lemmatization(boolean)(optional) - technique which is used to reduce words to a normalized form. In lemmatization, the transformation uses a pre-trained spacy models components (part-of-speech, context, normalization table) to map different variants of a word back to its root format. So, with this approach, we are able to reduce non trivial inflections such as “is”, “was”, “were” back to the root “be”. Lemmatization can potentially increase the quality of the classification model. The default value is 'false'.
iterations(number)(optional) - number of iterations of model training on a given training set. The default value is '30'.

{
"trainer_name": "easyrpaml_cl_spacy3_model",
"trainer_version": "3.3.0",
"trainer_description": "Text classification",
"train_config": {
"single_label": true,
"lemmatization": true,
"iterations": 5

}
}

Model Training Data File

To train a Spacy Information Extraction models system provides train_data.json.

{
"data": [{
"documentId": "43bdb816-ddfa-47c6-bad4-866c866adfb2",
"text": ["Boston Park City Group 03093 Tennessee Hill Boston, 38-512 +7 (752) 199-8334 allstate@corp.com Invoice Number: 8603 163534 Invoice Invoice Date: 12/02/2020 Payment Due: 12/04/2020 Duke Realty Corporation 86 Vera Trail Vera, 11305 +506 (796) 883-5554 Icossans8i@dmoz.org Amount Due: $ 504.90 Item Description Quantity Price Total Beef cheek fresh Scoop organic tastyzilla flaky fresh 19.00 $18.00 $342.00 Island oasis Treehouse red renew food brew healthy 3.00 $39.00 $117.00 Subtotal: $ 459.00 Tax 10.00% $ 45.90 Total: $ 504.90 Amount Due: $ 504.90 | Make all checks payable to Duke Realty Corporation. If you have any questions concerning this Invoice please contact Lina Upchurch on +52 (915) 649-1513 or lupchurchgf@lycos.com "],
"categories": ["Invoice"],
"ocr": [" <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title></title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <meta name="ocr-system" content="tesseract 5.3.0"> <meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf"> </head> <body> <div class="ocr_page" id="page_1" title="image "43bdb816-ddfa-47c6-bad4-866c866adfb2.pdf_000.jpg"; bbox 0 0 1532 1982; ppageno 0; scan_res 180 180"> <div class="ocr_carea" id="block_1_1" title="bbox 1078 114 1460 148"> <p class="ocr_par" id="par_1_1" lang="eng" title="bbox 1078 114 1460 148"><span class="ocr_line" id="line_1_1" title="bbox 1078 114 1460 148; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 8"> <span class="ocrx_word" id="word_1_1" title="bbox 1078 115 1189 140; x_wconf 95">Boston</span> <span class="ocrx_word" id="word_1_2" title="bbox 1204 114 1275 140; x_wconf 95">Park</span> <span class="ocrx_word" id="word_1_3" title="bbox 1285 114 1349 148; x_wconf 96">City</span> <span class="ocrx_word" id="word_1_4" title="bbox 1359 115 1460 148; x_wconf 95">Group</span> </span></p> </div> <div class="ocr_carea" id="block_1_2" title="bbox 1207 157 1459 178"> <p class="ocr_par" id="par_1_2" lang="eng" title="bbox 1207 157 1459 178"><span class="ocr_line" id="line_1_2" title="bbox 1207 157 1459 178; baseline 0 0; x_size 26.310345; x_descenders 5.3103447; x_ascenders 7"> <span class="ocrx_word" id="word_1_5" title="bbox 1207 158 1278 178; x_wconf 96">03093</span> <span class="ocrx_word" id="word_1_6" title="bbox 1289 158 1413 178; x_wconf 95">Tennessee</span> <span class="ocrx_word" id="word_1_7" title="bbox 1425 157 1459 178; x_wconf 95">Hill</span> </span></p> </div> <div class="ocr_carea" id="block_1_3" title="bbox 1274 195 1460 219"> <p class="ocr_par" id="par_1_3" lang="eng" title="bbox 1274 195 1460 219"><span class="ocr_line" id="line_1_3" title="bbox 1274 195 1460 219; baseline 0 -4; x_size 24; x_descenders 4; x_ascenders 6"> <span class="ocrx_word" id="word_1_8" title="bbox 1274 195 1360 219; x_wconf 95">Boston,</span> <span class="ocrx_word" id="word_1_9" title="bbox 1378 195 1460 215; x_wconf 96">38-512</span> </span></p> </div> <div class="ocr_carea" id="block_1_4" title="bbox 1240 231 1461 255"> <p class="ocr_par" id="par_1_4" lang="eng" title="bbox 1240 231 1461 255"><span class="ocr_line" id="line_1_4" title="bbox 1240 231 1461 255; baseline 0 -4; x_size 27.282051; x_descenders 6.8205128; x_ascenders 6.8205128"> <span class="ocrx_word" id="word_1_10" title="bbox 1240 231 1270 251; x_wconf 96">+7</span> <span class="ocrx_word" id="word_1_11" title="bbox 1280 231 1337 255; x_wconf 96">(752)</span> <span class="ocrx_word" id="word_1_12" title="bbox 1349 231 1461 251; x_wconf 96">199-8334</span> </span></p> </div> <div class="ocr_carea" id="block_1_5" title="bbox 1237 267 1460 294"> <p class="ocr_par" id="par_1_5" lang="eng" title="bbox 1237 267 1460 294"><span class="ocr_line" id="line_1_5" title="bbox 1237 267 1460 294; baseline 0 -6; x_size 27; x_descenders 6; x_ascenders 7"> <span class="ocrx_word" id="word_1_13" title="bbox 1237 267 1460 294; x_wconf 91">allstate@corp.com</span> </span></p> </div> <div class="ocr_carea" id="block_1_6" title="bbox 931 413 1116 432"> <p class="ocr_par" id="par_1_6" lang="eng" title="bbox 931 413 1116 432"><span class="ocr_line" id="line_1_6" title="bbox 931 413 1116 432; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_14" title="bbox 931 413 1013 432; x_wconf 95">Invoice</span> <span class="ocrx_word" id="word_1_15" title="bbox 1022 413 1116 432; x_wconf 95">Number:</span> </span></p> </div> <div class="ocr_carea" id="block_1_7" title="bbox 1159 414 1291 432"> <p class="ocr_par" id="par_1_7" lang="eng" title="bbox 1159 414 1291 432"><span class="ocr_line" id="line_1_7" title="bbox 1159 414 1291 432; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_16" title="bbox 1159 414 1208 432; x_wconf 86">8603</span> <span class="ocrx_word" id="word_1_17" title="bbox 1215 414 1291 432; x_wconf 86">163534</span> </span></p> </div> <div class="ocr_carea" id="block_1_8" title="bbox 121 412 349 465"> <p class="ocr_par" id="par_1_8" lang="eng" title="bbox 121 412 349 465"><span class="ocr_line" id="line_1_8" title="bbox 121 412 349 465; baseline 0.004 -1; x_size 58.426666; x_descenders 5.4266663; x_ascenders 16"> <span class="ocrx_word" id="word_1_18" title="bbox 121 412 349 465; x_wconf 94">Invoice</span> </span></p> </div> <div class="ocr_carea" id="block_1_9" title="bbox 972 463 1117 482"> <p class="ocr_par" id="par_1_9" lang="eng" title="bbox 972 463 1117 482"><span class="ocr_line" id="line_1_9" title="bbox 972 463 1117 482; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_19" title="bbox 972 463 1055 482; x_wconf 96">Invoice</span> <span class="ocrx_word" id="word_1_20" title="bbox 1064 464 1117 482; x_wconf 89">Date:</span> </span></p> </div> <div class="ocr_carea" id="block_1_10" title="bbox 1168 464 1290 485"> <p class="ocr_par" id="par_1_10" lang="eng" title="bbox 1168 464 1290 485"><span class="ocr_line" id="line_1_10" title="bbox 1168 464 1290 485; baseline 0 -3; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_21" title="bbox 1168 464 1290 485; x_wconf 96">12/02/2020</span> </span></p> </div> <div class="ocr_carea" id="block_1_11" title="bbox 961 514 1117 538"> <p class="ocr_par" id="par_1_11" lang="eng" title="bbox 961 514 1117 538"><span class="ocr_line" id="line_1_11" title="bbox 961 514 1117 538; baseline 0 -6; x_size 24; x_descenders 6; x_ascenders 5"> <span class="ocrx_word" id="word_1_22" title="bbox 961 514 1063 538; x_wconf 96">Payment</span> <span class="ocrx_word" id="word_1_23" title="bbox 1072 514 1117 532; x_wconf 87">Due:</span> </span></p> </div> <div class="ocr_carea" id="block_1_12" title="bbox 1168 514 1290 535"> <p class="ocr_par" id="par_1_12" lang="eng" title="bbox 1168 514 1290 535"><span class="ocr_line" id="line_1_12" title="bbox 1168 514 1290 535; baseline 0 -3; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_24" title="bbox 1168 514 1290 535; x_wconf 96">12/04/2020</span> </span></p> </div> <div class="ocr_carea" id="block_1_13" title="bbox 124 556 390 581"> <p class="ocr_par" id="par_1_13" lang="eng" title="bbox 124 556 390 581"><span class="ocr_line" id="line_1_13" title="bbox 124 556 390 581; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_25" title="bbox 124 556 178 575; x_wconf 96">Duke</span> <span class="ocrx_word" id="word_1_26" title="bbox 187 556 253 581; x_wconf 96">Realty</span> <span class="ocrx_word" id="word_1_27" title="bbox 261 557 390 581; x_wconf 96">Corporation</span> </span></p> </div> <div class="ocr_carea" id="block_1_14" title="bbox 123 589 257 608"> <p class="ocr_par" id="par_1_14" lang="eng" title="bbox 123 589 257 608"><span class="ocr_line" id="line_1_14" title="bbox 123 589 257 608; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_28" title="bbox 123 590 147 608; x_wconf 96">86</span> <span class="ocrx_word" id="word_1_29" title="bbox 156 590 204 608; x_wconf 96">Vera</span> <span class="ocrx_word" id="word_1_30" title="bbox 213 589 257 608; x_wconf 96">Trail</span> </span></p> </div> <div class="ocr_carea" id="block_1_15" title="bbox 122 623 256 644"> <p class="ocr_par" id="par_1_15" lang="eng" title="bbox 122 623 256 644"><span class="ocr_line" id="line_1_15" title="bbox 122 623 256 644; baseline 0 -3; x_size 21; x_descenders 3; x_ascenders 5"> <span class="ocrx_word" id="word_1_31" title="bbox 122 623 176 644; x_wconf 95">Vera,</span> <span class="ocrx_word" id="word_1_32" title="bbox 194 623 256 641; x_wconf 94">11305</span> </span></p> </div> <div class="ocr_carea" id="block_1_16" title="bbox 125 657 352 679"> <p class="ocr_par" id="par_1_16" lang="eng" title="bbox 125 657 352 679"><span class="ocr_line" id="line_1_16" title="bbox 125 657 352 679; baseline 0 -4; x_size 24.622223; x_descenders 6.1555557; x_ascenders 6.1555557"> <span class="ocrx_word" id="word_1_33" title="bbox 125 657 178 675; x_wconf 96">+506</span> <span class="ocrx_word" id="word_1_34" title="bbox 188 657 240 679; x_wconf 96">(796)</span> <span class="ocrx_word" id="word_1_35" title="bbox 250 657 352 675; x_wconf 96">883-5554</span> </span></p> </div> <div class="ocr_carea" id="block_1_17" title="bbox 124 689 360 714"> <p class="ocr_par" id="par_1_17" lang="eng" title="bbox 124 689 360 714"><span class="ocr_line" id="line_1_17" title="bbox 124 689 360 714; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_36" title="bbox 124 689 360 714; x_wconf 72">Icossans8i@dmoz.org</span> </span></p> </div> <div class="ocr_carea" id="block_1_18" title="bbox 967 713 1117 731"> <p class="ocr_par" id="par_1_18" lang="eng" title="bbox 967 713 1117 731"><span class="ocr_line" id="line_1_18" title="bbox 967 713 1117 731; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_37" title="bbox 967 713 1063 731; x_wconf 96">Amount</span> <span class="ocrx_word" id="word_1_38" title="bbox 1072 713 1117 731; x_wconf 96">Due:</span> </span></p> </div> <div class="ocr_carea" id="block_1_19" title="bbox 1153 711 1249 734"> <p class="ocr_par" id="par_1_19" lang="eng" title="bbox 1153 711 1249 734"><span class="ocr_line" id="line_1_19" title="bbox 1153 711 1249 734; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_39" title="bbox 1153 711 1164 734; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_40" title="bbox 1174 713 1249 731; x_wconf 96">504.90</span> </span></p> </div> <div class="ocr_carea" id="block_1_20" title="bbox 74 821 124 839"> <p class="ocr_par" id="par_1_20" lang="eng" title="bbox 74 821 124 839"><span class="ocr_line" id="line_1_20" title="bbox 74 821 124 839; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_41" title="bbox 74 821 124 839; x_wconf 95">Item</span> </span></p> </div> <div class="ocr_carea" id="block_1_21" title="bbox 355 820 492 845"> <p class="ocr_par" id="par_1_21" lang="eng" title="bbox 355 820 492 845"><span class="ocr_line" id="line_1_21" title="bbox 355 820 492 845; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_42" title="bbox 355 820 492 845; x_wconf 96">Description</span> </span></p> </div> <div class="ocr_carea" id="block_1_22" title="bbox 950 820 1055 845"> <p class="ocr_par" id="par_1_22" lang="eng" title="bbox 950 820 1055 845"><span class="ocr_line" id="line_1_22" title="bbox 950 820 1055 845; baseline -0.019 -4; x_size 26; x_descenders 6; x_ascenders 7"> <span class="ocrx_word" id="word_1_43" title="bbox 950 820 1055 845; x_wconf 96">Quantity</span> </span></p> </div> <div class="ocr_carea" id="block_1_23" title="bbox 1165 820 1224 839"> <p class="ocr_par" id="par_1_23" lang="eng" title="bbox 1165 820 1224 839"><span class="ocr_line" id="line_1_23" title="bbox 1165 820 1224 839; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_44" title="bbox 1165 820 1224 839; x_wconf 96">Price</span> </span></p> </div> <div class="ocr_carea" id="block_1_24" title="bbox 1348 820 1409 839"> <p class="ocr_par" id="par_1_24" lang="eng" title="bbox 1348 820 1409 839"><span class="ocr_line" id="line_1_24" title="bbox 1348 820 1409 839; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_45" title="bbox 1348 820 1409 839; x_wconf 96">Total</span> </span></p> </div> <div class="ocr_carea" id="block_1_25" title="bbox 81 892 265 911"> <p class="ocr_par" id="par_1_25" lang="eng" title="bbox 81 892 265 911"><span class="ocr_line" id="line_1_25" title="bbox 81 892 265 911; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_46" title="bbox 81 892 129 911; x_wconf 96">Beef</span> <span class="ocrx_word" id="word_1_47" title="bbox 138 892 202 911; x_wconf 96">cheek</span> <span class="ocrx_word" id="word_1_48" title="bbox 212 892 265 911; x_wconf 96">fresh</span> </span></p> </div> <div class="ocr_carea" id="block_1_26" title="bbox 361 892 754 917"> <p class="ocr_par" id="par_1_26" lang="eng" title="bbox 361 892 754 917"><span class="ocr_line" id="line_1_26" title="bbox 361 892 754 917; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_49" title="bbox 361 893 430 917; x_wconf 96">Scoop</span> <span class="ocrx_word" id="word_1_50" title="bbox 440 893 524 917; x_wconf 93">organic</span> <span class="ocrx_word" id="word_1_51" title="bbox 533 892 629 917; x_wconf 92">tastyzilla</span> <span class="ocrx_word" id="word_1_52" title="bbox 640 892 691 917; x_wconf 96">flaky</span> <span class="ocrx_word" id="word_1_53" title="bbox 700 892 754 911; x_wconf 96">fresh</span> </span></p> </div> <div class="ocr_carea" id="block_1_27" title="bbox 978 893 1035 911"> <p class="ocr_par" id="par_1_27" lang="eng" title="bbox 978 893 1035 911"><span class="ocr_line" id="line_1_27" title="bbox 978 893 1035 911; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_54" title="bbox 978 893 1035 911; x_wconf 96">19.00</span> </span></p> </div> <div class="ocr_carea" id="block_1_28" title="bbox 1128 891 1199 914"> <p class="ocr_par" id="par_1_28" lang="eng" title="bbox 1128 891 1199 914"><span class="ocr_line" id="line_1_28" title="bbox 1128 891 1199 914; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_55" title="bbox 1128 891 1199 914; x_wconf 95">$18.00</span> </span></p> </div> <div class="ocr_carea" id="block_1_29" title="bbox 1308 891 1393 914"> <p class="ocr_par" id="par_1_29" lang="eng" title="bbox 1308 891 1393 914"><span class="ocr_line" id="line_1_29" title="bbox 1308 891 1393 914; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_56" title="bbox 1308 891 1393 914; x_wconf 96">$342.00</span> </span></p> </div> <div class="ocr_carea" id="block_1_30" title="bbox 81 985 210 1004"> <p class="ocr_par" id="par_1_30" lang="eng" title="bbox 81 985 210 1004"><span class="ocr_line" id="line_1_30" title="bbox 81 985 210 1004; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_57" title="bbox 81 985 145 1004; x_wconf 93">Island</span> <span class="ocrx_word" id="word_1_58" title="bbox 155 986 210 1004; x_wconf 96">oasis</span> </span></p> </div> <div class="ocr_carea" id="block_1_31" title="bbox 361 985 817 1010"> <p class="ocr_par" id="par_1_31" lang="eng" title="bbox 361 985 817 1010"><span class="ocr_line" id="line_1_31" title="bbox 361 985 817 1010; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_59" title="bbox 361 985 478 1004; x_wconf 96">Treehouse</span> <span class="ocrx_word" id="word_1_60" title="bbox 489 985 522 1004; x_wconf 96">red</span> <span class="ocrx_word" id="word_1_61" title="bbox 534 991 601 1004; x_wconf 96">renew</span> <span class="ocrx_word" id="word_1_62" title="bbox 609 985 660 1004; x_wconf 96">food</span> <span class="ocrx_word" id="word_1_63" title="bbox 672 985 725 1004; x_wconf 96">brew</span> <span class="ocrx_word" id="word_1_64" title="bbox 735 985 817 1010; x_wconf 95">healthy</span> </span></p> </div> <div class="ocr_carea" id="block_1_32" title="bbox 984 986 1028 1004"> <p class="ocr_par" id="par_1_32" lang="eng" title="bbox 984 986 1028 1004"><span class="ocr_line" id="line_1_32" title="bbox 984 986 1028 1004; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_65" title="bbox 984 986 1028 1004; x_wconf 95">3.00</span> </span></p> </div> <div class="ocr_carea" id="block_1_33" title="bbox 1128 984 1199 1007"> <p class="ocr_par" id="par_1_33" lang="eng" title="bbox 1128 984 1199 1007"><span class="ocr_line" id="line_1_33" title="bbox 1128 984 1199 1007; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_66" title="bbox 1128 984 1199 1007; x_wconf 96">$39.00</span> </span></p> </div> <div class="ocr_carea" id="block_1_34" title="bbox 1308 984 1393 1007"> <p class="ocr_par" id="par_1_34" lang="eng" title="bbox 1308 984 1393 1007"><span class="ocr_line" id="line_1_34" title="bbox 1308 984 1393 1007; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_67" title="bbox 1308 984 1393 1007; x_wconf 96">$117.00</span> </span></p> </div> <div class="ocr_carea" id="block_1_35" title="bbox 1040 1373 1138 1392"> <p class="ocr_par" id="par_1_35" lang="eng" title="bbox 1040 1373 1138 1392"><span class="ocr_line" id="line_1_35" title="bbox 1040 1373 1138 1392; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_68" title="bbox 1040 1373 1138 1392; x_wconf 96">Subtotal:</span> </span></p> </div> <div class="ocr_carea" id="block_1_36" title="bbox 1255 1372 1348 1395"> <p class="ocr_par" id="par_1_36" lang="eng" title="bbox 1255 1372 1348 1395"><span class="ocr_line" id="line_1_36" title="bbox 1255 1372 1348 1395; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_69" title="bbox 1255 1372 1265 1395; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_70" title="bbox 1275 1374 1348 1392; x_wconf 93">459.00</span> </span></p> </div> <div class="ocr_carea" id="block_1_37" title="bbox 1040 1431 1173 1449"> <p class="ocr_par" id="par_1_37" lang="eng" title="bbox 1040 1431 1173 1449"><span class="ocr_line" id="line_1_37" title="bbox 1040 1431 1173 1449; baseline 0 0; x_size 24.444445; x_descenders 6.1111112; x_ascenders 6.1111112"> <span class="ocrx_word" id="word_1_71" title="bbox 1040 1431 1077 1449; x_wconf 94">Tax</span> <span class="ocrx_word" id="word_1_72" title="bbox 1096 1431 1173 1449; x_wconf 94">10.00%</span> </span></p> </div> <div class="ocr_carea" id="block_1_38" title="bbox 1255 1429 1334 1452"> <p class="ocr_par" id="par_1_38" lang="eng" title="bbox 1255 1429 1334 1452"><span class="ocr_line" id="line_1_38" title="bbox 1255 1429 1334 1452; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_73" title="bbox 1255 1429 1265 1452; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_74" title="bbox 1275 1431 1334 1449; x_wconf 95">45.90</span> </span></p> </div> <div class="ocr_carea" id="block_1_39" title="bbox 1040 1486 1100 1505"> <p class="ocr_par" id="par_1_39" lang="eng" title="bbox 1040 1486 1100 1505"><span class="ocr_line" id="line_1_39" title="bbox 1040 1486 1100 1505; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_75" title="bbox 1040 1486 1100 1505; x_wconf 96">Total:</span> </span></p> </div> <div class="ocr_carea" id="block_1_40" title="bbox 1255 1485 1348 1508"> <p class="ocr_par" id="par_1_40" lang="eng" title="bbox 1255 1485 1348 1508"><span class="ocr_line" id="line_1_40" title="bbox 1255 1485 1348 1508; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_76" title="bbox 1255 1485 1265 1508; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_77" title="bbox 1277 1487 1348 1505; x_wconf 83">504.90</span> </span></p> </div> <div class="ocr_carea" id="block_1_41" title="bbox 1039 1544 1200 1562"> <p class="ocr_par" id="par_1_41" lang="eng" title="bbox 1039 1544 1200 1562"><span class="ocr_line" id="line_1_41" title="bbox 1039 1544 1200 1562; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_78" title="bbox 1039 1544 1138 1562; x_wconf 96">Amount</span> <span class="ocrx_word" id="word_1_79" title="bbox 1147 1544 1200 1562; x_wconf 96">Due:</span> </span></p> </div> <div class="ocr_carea" id="block_1_42" title="bbox 1255 1542 1355 1565"> <p class="ocr_par" id="par_1_42" lang="eng" title="bbox 1255 1542 1355 1565"><span class="ocr_line" id="line_1_42" title="bbox 1255 1542 1355 1565; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_80" title="bbox 1255 1542 1266 1565; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_81" title="bbox 1278 1544 1355 1562; x_wconf 96">504.90</span> </span></p> </div> <div class="ocr_carea" id="block_1_43" title="bbox 1024 1524 1477 1581"> <p class="ocr_par" id="par_1_43" lang="eng" title="bbox 1024 1524 1477 1581"><span class="ocr_line" id="line_1_43" title="bbox 1024 1524 1477 1581; textangle 90; x_size 605.33331; x_descenders 151.33333; x_ascenders 151.33333"> <span class="ocrx_word" id="word_1_82" title="bbox 1024 1524 1477 1581; x_wconf 90">|</span> </span></p> </div> <div class="ocr_carea" id="block_1_44" title="bbox 116 1660 697 1685"> <p class="ocr_par" id="par_1_44" lang="eng" title="bbox 116 1660 697 1685"><span class="ocr_line" id="line_1_44" title="bbox 116 1660 697 1685; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_83" title="bbox 116 1660 173 1679; x_wconf 96">Make</span> <span class="ocrx_word" id="word_1_84" title="bbox 182 1660 204 1679; x_wconf 96">all</span> <span class="ocrx_word" id="word_1_85" title="bbox 213 1660 285 1679; x_wconf 96">checks</span> <span class="ocrx_word" id="word_1_86" title="bbox 294 1660 378 1685; x_wconf 96">payable</span> <span class="ocrx_word" id="word_1_87" title="bbox 386 1662 408 1679; x_wconf 96">to</span> <span class="ocrx_word" id="word_1_88" title="bbox 424 1660 478 1679; x_wconf 96">Duke</span> <span class="ocrx_word" id="word_1_89" title="bbox 488 1660 553 1685; x_wconf 96">Realty</span> <span class="ocrx_word" id="word_1_90" title="bbox 561 1661 697 1685; x_wconf 96">Corporation.</span> </span></p> </div> <div class="ocr_carea" id="block_1_45" title="bbox 116 1696 1289 1721"> <p class="ocr_par" id="par_1_45" lang="eng" title="bbox 116 1696 1289 1721"><span class="ocr_line" id="line_1_45" title="bbox 116 1696 1289 1721; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_91" title="bbox 116 1696 128 1715; x_wconf 89">If</span> <span class="ocrx_word" id="word_1_92" title="bbox 135 1702 173 1721; x_wconf 96">you</span> <span class="ocrx_word" id="word_1_93" title="bbox 184 1696 233 1715; x_wconf 96">have</span> <span class="ocrx_word" id="word_1_94" title="bbox 242 1702 280 1721; x_wconf 96">any</span> <span class="ocrx_word" id="word_1_95" title="bbox 288 1697 393 1721; x_wconf 96">questions</span> <span class="ocrx_word" id="word_1_96" title="bbox 401 1697 521 1721; x_wconf 96">concerning</span> <span class="ocrx_word" id="word_1_97" title="bbox 531 1696 568 1715; x_wconf 95">this</span> <span class="ocrx_word" id="word_1_98" title="bbox 577 1697 653 1715; x_wconf 95">Invoice</span> <span class="ocrx_word" id="word_1_99" title="bbox 663 1696 730 1721; x_wconf 96">please</span> <span class="ocrx_word" id="word_1_100" title="bbox 739 1698 819 1715; x_wconf 96">contact</span> <span class="ocrx_word" id="word_1_101" title="bbox 835 1697 876 1715; x_wconf 96">Lina</span> <span class="ocrx_word" id="word_1_102" title="bbox 887 1696 989 1721; x_wconf 96">Upchurch</span> <span class="ocrx_word" id="word_1_103" title="bbox 999 1702 1025 1715; x_wconf 94">on</span> <span class="ocrx_word" id="word_1_104" title="bbox 1044 1697 1083 1715; x_wconf 94">+52</span> <span class="ocrx_word" id="word_1_105" title="bbox 1093 1697 1145 1719; x_wconf 96">(915)</span> <span class="ocrx_word" id="word_1_106" title="bbox 1155 1697 1256 1715; x_wconf 96">649-1513</span> <span class="ocrx_word" id="word_1_107" title="bbox 1266 1702 1289 1715; x_wconf 96">or</span> </span></p> </div> <div class="ocr_carea" id="block_1_46" title="bbox 116 1732 375 1757"> <p class="ocr_par" id="par_1_46" lang="eng" title="bbox 116 1732 375 1757"><span class="ocr_line" id="line_1_46" title="bbox 116 1732 375 1757; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_108" title="bbox 116 1732 375 1757; x_wconf 91">lupchurchgf@lycos.com</span> </span></p> </div> </div> </body> </html>"]
}
],
"categories": ["Invoice", "Remittance Advice"],
"single_label": true
}

Where:

categories(list of strings)(optional) - the categories to which a document can belong.
single_label(boolean)(optional) - the type of classification model (multi-label or single-label classification). The default value is 'true'.

OpenAI CL Models

The OpenAI model uses OpenAI API to call LLM for request processing. The idea of such models are minifies (depending of specified renderer in model configuration) input document (HTML or HOCR), then send request to OpenAI that clasiffy the document.

Curentlly platform has the following OpenAI CL models:

ml_cl_openai_model - uses hOCR source base
ml_clhtml_openai_model - uses HTML source base

The models configuration very similar to the IE OpenAI ones, so we ommit the common configurations and provide only diferences.

ml_cl_openai_model

Here is a default models chat with OpenAI:

You are an expert in HTML document classification. You have the following categories (name per line):
```
Invoice
Remittance Advice
```
Your answer must be a JSON object where each key is a category and each value is a double (between 0 and 1) representing the probability the document belongs to that category. Ensure that the sum of all scores across categories for a document is always equal to 1.

The input HTML document to classify is:
```html
{html}
```

Model Training

Here is the models default training config:

{
	<span style="color: rgb(135,16,148);">"trainer_name"</span>: <span style="color: rgb(6,125,23);">"ml_cl_openai_model"</span>,
	<span style="color: rgb(135,16,148);">"trainer_version"</span>: <span style="color: rgb(6,125,23);">"3.3.0"</span>,
	<span style="color: rgb(135,16,148);">"trainer_description"</span>: <span style="color: rgb(6,125,23);">"HOCR Classification with OpenAI"</span>,
	<span style="color: rgb(135,16,148);">"trainer_data_required"</span>: <span style="color: rgb(0,51,179);">false</span>,
	<span style="color: rgb(135,16,148);">"prompts_config"</span>: {
		<span style="color: rgb(135,16,148);">"debug"</span>: <span style="color: rgb(0,51,179);">true</span>,
		<span style="color: rgb(135,16,148);">"messages"</span>: [
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"system"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{systemRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"user"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{userRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>}
		],
		<span style="color: rgb(135,16,148);">"systemRolePrompt"</span>: <span style="color: rgb(6,125,23);">"You are an expert in HTML document classification. You have the following categories (name per line):</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Invoice</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Remittance Advice</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your answer must be a JSON object where each key is a category and each value is a double (between 0 and 1) representing the probability the document belongs to that category. Ensure that the sum of all scores across categories for a document is always equal to 1."</span>,
		<span style="color: rgb(135,16,148);">"userRolePrompt"</span>: <span style="color: rgb(6,125,23);">"The input HTML document to classify is:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```html</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">{html}</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```"</span>,
		<span style="color: rgb(135,16,148);">"html"</span>: <span style="color: rgb(6,125,23);">""</span>,
		<span style="color: rgb(135,16,148);">"environment"</span>: <span style="color: rgb(6,125,23);">"OpenAI"</span>,
		<span style="color: rgb(135,16,148);">"open_ai_model"</span>: <span style="color: rgb(6,125,23);">"gpt-4o"</span>,
		<span style="color: rgb(135,16,148);">"temperature"</span>: <span style="color: rgb(23,80,235);">0</span>,
		<span style="color: rgb(135,16,148);">"track_into_langfuse"</span>: <span style="color: rgb(0,51,179);">false</span>,
		<span style="color: rgb(135,16,148);">"hocr2html"</span>: {
			<span style="color: rgb(135,16,148);">"type"</span>: <span style="color: rgb(6,125,23);">"table"</span>,
			<span style="color: rgb(135,16,148);">"bbox_to_cell_tolerance_x"</span>: <span style="color: rgb(23,80,235);">10</span>,
			<span style="color: rgb(135,16,148);">"bbox_to_cell_tolerance_y"</span>: <span style="color: rgb(23,80,235);">10</span>,
			<span style="color: rgb(135,16,148);">"cell_to_row_tolerance"</span>: <span style="color: rgb(23,80,235);">20</span>,
			<span style="color: rgb(135,16,148);">"row_to_table_tolerance"</span>: <span style="color: rgb(23,80,235);">10
</span><span style="color: rgb(23,80,235);">		</span>}
	}
}

where:

prompts_config - the default prompts configuration saved into trained model
messages - a prompt messages structure to use during sending to OpenAI API
html - the document simplified html that model creates and injected into prompt context
environment - a secret vault aliace where stored JSON with environment variables to set, before call the LLM API
temperature - the request temperature, depends of LLM model, ussually can be gradated like: Coding / Math - 0.0; Data Cleaning / Data Analysis 1.0; Creative Writing / Poetry - 1.5
open_ai_model - an OpenAI model to use, required
track_into_langfuse - track the OpenAI conversation into Langfuse if true
debug - boolean switches debug messages on
hocr2html - HOCR to html rendering configuration

HOCR to HTML rendering configuration

The hocr2html renderind algoritms are the similar to the ml_ie_openai_model model, but the only the following rendering exist:

default - put word in a order htat is exist in HOCR
table - put words according to recognized table layout

Sample

The Intelligent Document Processing (IDP) contains document set IDP_SAMPLE_CLASSIFICATION_OPENAI that configured to work with the ml_cl_openai_model

ml_clhtml_openai_model

Here is a default models chat with OpenAI:

You are an expert in document classification. You have the following categories (name per line):
```
Application invoice
Payment Details
```
Your answer must be a JSON object where each key is a category and each value is a double (between 0 and 1) representing the probability the document belongs to that category. Ensure that the sum of all scores across categories for a document is always equal to 1.

Your input to classify is:
```txt
{text}
```

HTML to Text rendering

To classify a document we only need a text, and because of html to text is a trivial operation, so there is no any additional configuration here.

Model Training

Here is the models default training config:

{
	<span style="color: rgb(135,16,148);">"trainer_name"</span>: <span style="color: rgb(6,125,23);">"ml_clhtml_openai_model"</span>,
	<span style="color: rgb(135,16,148);">"trainer_version"</span>: <span style="color: rgb(6,125,23);">"3.3.0"</span>,
	<span style="color: rgb(135,16,148);">"trainer_description"</span>: <span style="color: rgb(6,125,23);">"HTML Classification with OpenAI"</span>,
	<span style="color: rgb(135,16,148);">"trainer_data_required"</span>: <span style="color: rgb(0,51,179);">false</span>,
	<span style="color: rgb(135,16,148);">"prompts_config"</span>: {
		<span style="color: rgb(135,16,148);">"debug"</span>: <span style="color: rgb(0,51,179);">true</span>,
		<span style="color: rgb(135,16,148);">"messages"</span>: [
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"system"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{systemRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"user"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{userRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>}
		],
		<span style="color: rgb(135,16,148);">"systemRolePrompt"</span>: <span style="color: rgb(6,125,23);">"You are an expert in document classification. You have the following categories (name per line):</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Application invoice</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Payment Details</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your answer must be a JSON object where each key is a category and each value is a double (between 0 and 1) representing the probability the document belongs to that category. Ensure that the sum of all scores across categories for a document is always equal to 1."</span>,
		<span style="color: rgb(135,16,148);">"userRolePrompt"</span>: <span style="color: rgb(6,125,23);">"Your input to classify is:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```txt</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">{text}</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```"</span>,
		<span style="color: rgb(135,16,148);">"text"</span>: <span style="color: rgb(6,125,23);">""</span>,
		<span style="color: rgb(135,16,148);">"environment"</span>: <span style="color: rgb(6,125,23);">"OpenAI"</span>,
		<span style="color: rgb(135,16,148);">"open_ai_model"</span>: <span style="color: rgb(6,125,23);">"gpt-4o"</span>,
		<span style="color: rgb(135,16,148);">"temperature"</span>: <span style="color: rgb(23,80,235);">0</span>,
		<span style="color: rgb(135,16,148);">"track_into_langfuse"</span>: <span style="color: rgb(0,51,179);">false
</span><span style="color: rgb(0,51,179);">	</span>}
}

where:

prompts_config - the default prompts configuration saved into trained model
messages - a prompt messages structure to use during sending to OpenAI API
text - the documents text that model creates from HTML and injected into prompt context
environment - a secret vault aliace where stored JSON with environment variables to set, before call the LLM API
temperature - the request temperature, depends of LLM model, ussually can be gradated like: Coding / Math - 0.0; Data Cleaning / Data Analysis 1.0; Creative Writing / Poetry - 1.5
open_ai_model - an OpenAI model to use, required
track_into_langfuse - track the OpenAI conversation into Langfuse if true
debug - boolean switches debug messages on

Sample

The Classification HTML Sample contains document set CL_HTML_OPENAI SAMPLE that configured to work with the ml_clhtml_openai_model

Classification Models

Classification Models

Overview

Spacy CL Models

Document Classification as a Pipeline

Model Training Process

Model training

Package creation

Classification Process

Model execution

Model Training Configuration File

Model Training Data File

OpenAI CL Models

ml_cl_openai_model

Model Training

HOCR to HTML rendering configuration

Sample

ml_clhtml_openai_model

HTML to Text rendering

Model Training

Sample