Skip to main content

Information Extraction Models

Information Extraction Models

Overview

Information extraction (IE) is the automated retrieval of specific information related to a selected topic from input data. Information extraction tools make it possible to pull information from text documents, databases, websites or multiple sources.

EasyRPA provides infrastructure to create and run machine learning models that extract information from PDF, images, TXT and HTML documents.

Curentlly platform has the following IE set of models:

  • hOCR source base 
  • HTML source base

hOCR source

The input document are PDF and images that are converted into hOCR using platforms OCR.

{

	"images": [
	{
		"text_src": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.txt</span>",
		"json_src": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.json</span>",
		"hocr_src": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.html</span>",
		"content": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.jpg</span>",
		"dimensions": {
		"width": "1532",
		"height": "1982"
		}
	}
	]
}

Where image[] - are the pages of a source document, and for every page are:

  • text_src - text from OCR
  • hocr_src - the OCR result in hOCR format
  • json_src - the hOCR file reperesented as JSON for IE Human Task Type
  • content - the source page image
<!--?xml version="1.0" encoding="UTF-8"?-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "<a class="external-link" href="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" rel="nofollow">http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd</a>">
<html xmlns="<a class="external-link" href="http://www.w3.org/1999/xhtml" rel="nofollow">http://www.w3.org/1999/xhtml</a>" xml:lang="en" lang="en">
 <head>
	<title></title>
	<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
	<meta name="ocr-system" content="tesseract 5.3.0">
	<meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf">
 </head>
 <body>
	<div class="ocr_page" id="page_1" title="image &quot;17697330-2f97-4e97-98e2-6021f2b65bfe.pdf_000.jpg&quot;; bbox 0 0 1532 1982; ppageno 0; scan_res 180 180">
	 <div class="ocr_carea" id="block_1_1" title="bbox 1078 114 1460 148">
	<p class="ocr_par" id="par_1_1" lang="eng" title="bbox 1078 114 1460 148"><span class="ocr_line" id="line_1_1" title="bbox 1078 114 1460 148; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 8"> <span class="ocrx_word" id="word_1_1" title="bbox 1078 115 1189 140; x_wconf 95">Boston</span> <span class="ocrx_word" id="word_1_2" title="bbox 1204 114 1275 140; x_wconf 95">Park</span> <span class="ocrx_word" id="word_1_3" title="bbox 1285 114 1349 148; x_wconf 96">City</span> <span class="ocrx_word" id="word_1_4" title="bbox 1359 115 1460 148; x_wconf 95">Group</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_2" title="bbox 1207 157 1459 178">
	<p class="ocr_par" id="par_1_2" lang="eng" title="bbox 1207 157 1459 178"><span class="ocr_line" id="line_1_2" title="bbox 1207 157 1459 178; baseline 0 0; x_size 26.310345; x_descenders 5.3103447; x_ascenders 7"> <span class="ocrx_word" id="word_1_5" title="bbox 1207 158 1278 178; x_wconf 96">03093</span> <span class="ocrx_word" id="word_1_6" title="bbox 1289 158 1413 178; x_wconf 95">Tennessee</span> <span class="ocrx_word" id="word_1_7" title="bbox 1425 157 1459 178; x_wconf 95">Hill</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_3" title="bbox 1274 195 1460 219">
	<p class="ocr_par" id="par_1_3" lang="eng" title="bbox 1274 195 1460 219"><span class="ocr_line" id="line_1_3" title="bbox 1274 195 1460 219; baseline 0 -4; x_size 24; x_descenders 4; x_ascenders 6"> <span class="ocrx_word" id="word_1_8" title="bbox 1274 195 1360 219; x_wconf 95">Boston,</span> <span class="ocrx_word" id="word_1_9" title="bbox 1378 195 1460 215; x_wconf 96">38-512</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_4" title="bbox 1240 231 1461 255">
	<p class="ocr_par" id="par_1_4" lang="eng" title="bbox 1240 231 1461 255"><span class="ocr_line" id="line_1_4" title="bbox 1240 231 1461 255; baseline 0 -4; x_size 27.282051; x_descenders 6.8205128; x_ascenders 6.8205128"> <span class="ocrx_word" id="word_1_10" title="bbox 1240 231 1270 251; x_wconf 96">+7</span> <span class="ocrx_word" id="word_1_11" title="bbox 1280 231 1337 255; x_wconf 96">(752)</span> <span class="ocrx_word" id="word_1_12" title="bbox 1349 231 1461 251; x_wconf 96">199-8334</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_5" title="bbox 1237 267 1460 294">
	<p class="ocr_par" id="par_1_5" lang="eng" title="bbox 1237 267 1460 294"><span class="ocr_line" id="line_1_5" title="bbox 1237 267 1460 294; baseline 0 -6; x_size 27; x_descenders 6; x_ascenders 7"> <span class="ocrx_word" id="word_1_13" title="bbox 1237 267 1460 294; x_wconf 91">allstate@<a class="external-link" href="http://corp.com" rel="nofollow">corp.com</a></span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_6" title="bbox 931 413 1116 432">
	<p class="ocr_par" id="par_1_6" lang="eng" title="bbox 931 413 1116 432"><span class="ocr_line" id="line_1_6" title="bbox 931 413 1116 432; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_14" title="bbox 931 413 1013 432; x_wconf 95">Invoice</span> <span class="ocrx_word" id="word_1_15" title="bbox 1022 413 1116 432; x_wconf 95">Number:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_7" title="bbox 1159 414 1291 432">
	<p class="ocr_par" id="par_1_7" lang="eng" title="bbox 1159 414 1291 432"><span class="ocr_line" id="line_1_7" title="bbox 1159 414 1291 432; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_16" title="bbox 1159 414 1208 432; x_wconf 86">8603</span> <span class="ocrx_word" id="word_1_17" title="bbox 1215 414 1291 432; x_wconf 86">163534</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_8" title="bbox 121 412 349 465">
	<p class="ocr_par" id="par_1_8" lang="eng" title="bbox 121 412 349 465"><span class="ocr_line" id="line_1_8" title="bbox 121 412 349 465; baseline 0.004 -1; x_size 58.426666; x_descenders 5.4266663; x_ascenders 16"> <span class="ocrx_word" id="word_1_18" title="bbox 121 412 349 465; x_wconf 94">Invoice</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_9" title="bbox 972 463 1117 482">
	<p class="ocr_par" id="par_1_9" lang="eng" title="bbox 972 463 1117 482"><span class="ocr_line" id="line_1_9" title="bbox 972 463 1117 482; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_19" title="bbox 972 463 1055 482; x_wconf 96">Invoice</span> <span class="ocrx_word" id="word_1_20" title="bbox 1064 464 1117 482; x_wconf 89">Date:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_10" title="bbox 1168 464 1290 485">
	<p class="ocr_par" id="par_1_10" lang="eng" title="bbox 1168 464 1290 485"><span class="ocr_line" id="line_1_10" title="bbox 1168 464 1290 485; baseline 0 -3; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_21" title="bbox 1168 464 1290 485; x_wconf 96">12/02/2020</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_11" title="bbox 961 514 1117 538">
	<p class="ocr_par" id="par_1_11" lang="eng" title="bbox 961 514 1117 538"><span class="ocr_line" id="line_1_11" title="bbox 961 514 1117 538; baseline 0 -6; x_size 24; x_descenders 6; x_ascenders 5"> <span class="ocrx_word" id="word_1_22" title="bbox 961 514 1063 538; x_wconf 96">Payment</span> <span class="ocrx_word" id="word_1_23" title="bbox 1072 514 1117 532; x_wconf 87">Due:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_12" title="bbox 1168 514 1290 535">
	<p class="ocr_par" id="par_1_12" lang="eng" title="bbox 1168 514 1290 535"><span class="ocr_line" id="line_1_12" title="bbox 1168 514 1290 535; baseline 0 -3; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_24" title="bbox 1168 514 1290 535; x_wconf 96">12/04/2020</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_13" title="bbox 124 556 390 581">
	<p class="ocr_par" id="par_1_13" lang="eng" title="bbox 124 556 390 581"><span class="ocr_line" id="line_1_13" title="bbox 124 556 390 581; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_25" title="bbox 124 556 178 575; x_wconf 96">Duke</span> <span class="ocrx_word" id="word_1_26" title="bbox 187 556 253 581; x_wconf 96">Realty</span> <span class="ocrx_word" id="word_1_27" title="bbox 261 557 390 581; x_wconf 96">Corporation</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_14" title="bbox 123 589 257 608">
	<p class="ocr_par" id="par_1_14" lang="eng" title="bbox 123 589 257 608"><span class="ocr_line" id="line_1_14" title="bbox 123 589 257 608; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_28" title="bbox 123 590 147 608; x_wconf 96">86</span> <span class="ocrx_word" id="word_1_29" title="bbox 156 590 204 608; x_wconf 96">Vera</span> <span class="ocrx_word" id="word_1_30" title="bbox 213 589 257 608; x_wconf 96">Trail</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_15" title="bbox 122 623 256 644">
	<p class="ocr_par" id="par_1_15" lang="eng" title="bbox 122 623 256 644"><span class="ocr_line" id="line_1_15" title="bbox 122 623 256 644; baseline 0 -3; x_size 21; x_descenders 3; x_ascenders 5"> <span class="ocrx_word" id="word_1_31" title="bbox 122 623 176 644; x_wconf 95">Vera,</span> <span class="ocrx_word" id="word_1_32" title="bbox 194 623 256 641; x_wconf 94">11305</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_16" title="bbox 125 657 352 679">
	<p class="ocr_par" id="par_1_16" lang="eng" title="bbox 125 657 352 679"><span class="ocr_line" id="line_1_16" title="bbox 125 657 352 679; baseline 0 -4; x_size 24.622223; x_descenders 6.1555557; x_ascenders 6.1555557"> <span class="ocrx_word" id="word_1_33" title="bbox 125 657 178 675; x_wconf 96">+506</span> <span class="ocrx_word" id="word_1_34" title="bbox 188 657 240 679; x_wconf 96">(796)</span> <span class="ocrx_word" id="word_1_35" title="bbox 250 657 352 675; x_wconf 96">883-5554</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_17" title="bbox 124 689 360 714">
	<p class="ocr_par" id="par_1_17" lang="eng" title="bbox 124 689 360 714"><span class="ocr_line" id="line_1_17" title="bbox 124 689 360 714; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_36" title="bbox 124 689 360 714; x_wconf 72">Icossans8i@<a class="external-link" href="http://dmoz.org" rel="nofollow">dmoz.org</a></span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_18" title="bbox 967 713 1117 731">
	<p class="ocr_par" id="par_1_18" lang="eng" title="bbox 967 713 1117 731"><span class="ocr_line" id="line_1_18" title="bbox 967 713 1117 731; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_37" title="bbox 967 713 1063 731; x_wconf 96">Amount</span> <span class="ocrx_word" id="word_1_38" title="bbox 1072 713 1117 731; x_wconf 96">Due:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_19" title="bbox 1153 711 1249 734">
	<p class="ocr_par" id="par_1_19" lang="eng" title="bbox 1153 711 1249 734"><span class="ocr_line" id="line_1_19" title="bbox 1153 711 1249 734; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_39" title="bbox 1153 711 1164 734; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_40" title="bbox 1174 713 1249 731; x_wconf 96">504.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_20" title="bbox 74 821 124 839">
	<p class="ocr_par" id="par_1_20" lang="eng" title="bbox 74 821 124 839"><span class="ocr_line" id="line_1_20" title="bbox 74 821 124 839; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_41" title="bbox 74 821 124 839; x_wconf 95">Item</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_21" title="bbox 355 820 492 845">
	<p class="ocr_par" id="par_1_21" lang="eng" title="bbox 355 820 492 845"><span class="ocr_line" id="line_1_21" title="bbox 355 820 492 845; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_42" title="bbox 355 820 492 845; x_wconf 96">Description</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_22" title="bbox 950 820 1055 845">
	<p class="ocr_par" id="par_1_22" lang="eng" title="bbox 950 820 1055 845"><span class="ocr_line" id="line_1_22" title="bbox 950 820 1055 845; baseline -0.019 -4; x_size 26; x_descenders 6; x_ascenders 7"> <span class="ocrx_word" id="word_1_43" title="bbox 950 820 1055 845; x_wconf 96">Quantity</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_23" title="bbox 1165 820 1224 839">
	<p class="ocr_par" id="par_1_23" lang="eng" title="bbox 1165 820 1224 839"><span class="ocr_line" id="line_1_23" title="bbox 1165 820 1224 839; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_44" title="bbox 1165 820 1224 839; x_wconf 96">Price</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_24" title="bbox 1348 820 1409 839">
	<p class="ocr_par" id="par_1_24" lang="eng" title="bbox 1348 820 1409 839"><span class="ocr_line" id="line_1_24" title="bbox 1348 820 1409 839; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_45" title="bbox 1348 820 1409 839; x_wconf 96">Total</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_25" title="bbox 81 892 265 911">
	<p class="ocr_par" id="par_1_25" lang="eng" title="bbox 81 892 265 911"><span class="ocr_line" id="line_1_25" title="bbox 81 892 265 911; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_46" title="bbox 81 892 129 911; x_wconf 96">Beef</span> <span class="ocrx_word" id="word_1_47" title="bbox 138 892 202 911; x_wconf 96">cheek</span> <span class="ocrx_word" id="word_1_48" title="bbox 212 892 265 911; x_wconf 96">fresh</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_26" title="bbox 361 892 754 917">
	<p class="ocr_par" id="par_1_26" lang="eng" title="bbox 361 892 754 917"><span class="ocr_line" id="line_1_26" title="bbox 361 892 754 917; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_49" title="bbox 361 893 430 917; x_wconf 96">Scoop</span> <span class="ocrx_word" id="word_1_50" title="bbox 440 893 524 917; x_wconf 93">organic</span> <span class="ocrx_word" id="word_1_51" title="bbox 533 892 629 917; x_wconf 92">tastyzilla</span> <span class="ocrx_word" id="word_1_52" title="bbox 640 892 691 917; x_wconf 96">flaky</span> <span class="ocrx_word" id="word_1_53" title="bbox 700 892 754 911; x_wconf 96">fresh</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_27" title="bbox 978 893 1035 911">
	<p class="ocr_par" id="par_1_27" lang="eng" title="bbox 978 893 1035 911"><span class="ocr_line" id="line_1_27" title="bbox 978 893 1035 911; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_54" title="bbox 978 893 1035 911; x_wconf 96">19.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_28" title="bbox 1128 891 1199 914">
	<p class="ocr_par" id="par_1_28" lang="eng" title="bbox 1128 891 1199 914"><span class="ocr_line" id="line_1_28" title="bbox 1128 891 1199 914; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_55" title="bbox 1128 891 1199 914; x_wconf 95">$18.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_29" title="bbox 1308 891 1393 914">
	<p class="ocr_par" id="par_1_29" lang="eng" title="bbox 1308 891 1393 914"><span class="ocr_line" id="line_1_29" title="bbox 1308 891 1393 914; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_56" title="bbox 1308 891 1393 914; x_wconf 96">$342.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_30" title="bbox 81 985 210 1004">
	<p class="ocr_par" id="par_1_30" lang="eng" title="bbox 81 985 210 1004"><span class="ocr_line" id="line_1_30" title="bbox 81 985 210 1004; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_57" title="bbox 81 985 145 1004; x_wconf 93">Island</span> <span class="ocrx_word" id="word_1_58" title="bbox 155 986 210 1004; x_wconf 96">oasis</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_31" title="bbox 361 985 817 1010">
	<p class="ocr_par" id="par_1_31" lang="eng" title="bbox 361 985 817 1010"><span class="ocr_line" id="line_1_31" title="bbox 361 985 817 1010; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_59" title="bbox 361 985 478 1004; x_wconf 96">Treehouse</span> <span class="ocrx_word" id="word_1_60" title="bbox 489 985 522 1004; x_wconf 96">red</span> <span class="ocrx_word" id="word_1_61" title="bbox 534 991 601 1004; x_wconf 96">renew</span> <span class="ocrx_word" id="word_1_62" title="bbox 609 985 660 1004; x_wconf 96">food</span> <span class="ocrx_word" id="word_1_63" title="bbox 672 985 725 1004; x_wconf 96">brew</span> <span class="ocrx_word" id="word_1_64" title="bbox 735 985 817 1010; x_wconf 95">healthy</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_32" title="bbox 984 986 1028 1004">
	<p class="ocr_par" id="par_1_32" lang="eng" title="bbox 984 986 1028 1004"><span class="ocr_line" id="line_1_32" title="bbox 984 986 1028 1004; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_65" title="bbox 984 986 1028 1004; x_wconf 95">3.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_33" title="bbox 1128 984 1199 1007">
	<p class="ocr_par" id="par_1_33" lang="eng" title="bbox 1128 984 1199 1007"><span class="ocr_line" id="line_1_33" title="bbox 1128 984 1199 1007; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_66" title="bbox 1128 984 1199 1007; x_wconf 96">$39.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_34" title="bbox 1308 984 1393 1007">
	<p class="ocr_par" id="par_1_34" lang="eng" title="bbox 1308 984 1393 1007"><span class="ocr_line" id="line_1_34" title="bbox 1308 984 1393 1007; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_67" title="bbox 1308 984 1393 1007; x_wconf 96">$117.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_35" title="bbox 1040 1373 1138 1392">
	<p class="ocr_par" id="par_1_35" lang="eng" title="bbox 1040 1373 1138 1392"><span class="ocr_line" id="line_1_35" title="bbox 1040 1373 1138 1392; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_68" title="bbox 1040 1373 1138 1392; x_wconf 96">Subtotal:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_36" title="bbox 1255 1372 1348 1395">
	<p class="ocr_par" id="par_1_36" lang="eng" title="bbox 1255 1372 1348 1395"><span class="ocr_line" id="line_1_36" title="bbox 1255 1372 1348 1395; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_69" title="bbox 1255 1372 1265 1395; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_70" title="bbox 1275 1374 1348 1392; x_wconf 93">459.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_37" title="bbox 1040 1431 1173 1449">
	<p class="ocr_par" id="par_1_37" lang="eng" title="bbox 1040 1431 1173 1449"><span class="ocr_line" id="line_1_37" title="bbox 1040 1431 1173 1449; baseline 0 0; x_size 24.444445; x_descenders 6.1111112; x_ascenders 6.1111112"> <span class="ocrx_word" id="word_1_71" title="bbox 1040 1431 1077 1449; x_wconf 94">Tax</span> <span class="ocrx_word" id="word_1_72" title="bbox 1096 1431 1173 1449; x_wconf 94">10.00%</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_38" title="bbox 1255 1429 1334 1452">
	<p class="ocr_par" id="par_1_38" lang="eng" title="bbox 1255 1429 1334 1452"><span class="ocr_line" id="line_1_38" title="bbox 1255 1429 1334 1452; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_73" title="bbox 1255 1429 1265 1452; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_74" title="bbox 1275 1431 1334 1449; x_wconf 95">45.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_39" title="bbox 1040 1486 1100 1505">
	<p class="ocr_par" id="par_1_39" lang="eng" title="bbox 1040 1486 1100 1505"><span class="ocr_line" id="line_1_39" title="bbox 1040 1486 1100 1505; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_75" title="bbox 1040 1486 1100 1505; x_wconf 96">Total:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_40" title="bbox 1255 1485 1348 1508">
	<p class="ocr_par" id="par_1_40" lang="eng" title="bbox 1255 1485 1348 1508"><span class="ocr_line" id="line_1_40" title="bbox 1255 1485 1348 1508; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_76" title="bbox 1255 1485 1265 1508; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_77" title="bbox 1277 1487 1348 1505; x_wconf 83">504.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_41" title="bbox 1039 1544 1200 1562">
	<p class="ocr_par" id="par_1_41" lang="eng" title="bbox 1039 1544 1200 1562"><span class="ocr_line" id="line_1_41" title="bbox 1039 1544 1200 1562; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_78" title="bbox 1039 1544 1138 1562; x_wconf 96">Amount</span> <span class="ocrx_word" id="word_1_79" title="bbox 1147 1544 1200 1562; x_wconf 96">Due:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_42" title="bbox 1255 1542 1355 1565">
	<p class="ocr_par" id="par_1_42" lang="eng" title="bbox 1255 1542 1355 1565"><span class="ocr_line" id="line_1_42" title="bbox 1255 1542 1355 1565; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_80" title="bbox 1255 1542 1266 1565; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_81" title="bbox 1278 1544 1355 1562; x_wconf 96">504.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_43" title="bbox 1024 1524 1477 1581">
	<p class="ocr_par" id="par_1_43" lang="eng" title="bbox 1024 1524 1477 1581"><span class="ocr_line" id="line_1_43" title="bbox 1024 1524 1477 1581; textangle 90; x_size 605.33331; x_descenders 151.33333; x_ascenders 151.33333"> <span class="ocrx_word" id="word_1_82" title="bbox 1024 1524 1477 1581; x_wconf 90">|</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_44" title="bbox 116 1660 697 1685">
	<p class="ocr_par" id="par_1_44" lang="eng" title="bbox 116 1660 697 1685"><span class="ocr_line" id="line_1_44" title="bbox 116 1660 697 1685; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_83" title="bbox 116 1660 173 1679; x_wconf 96">Make</span> <span class="ocrx_word" id="word_1_84" title="bbox 182 1660 204 1679; x_wconf 96">all</span> <span class="ocrx_word" id="word_1_85" title="bbox 213 1660 285 1679; x_wconf 96">checks</span> <span class="ocrx_word" id="word_1_86" title="bbox 294 1660 378 1685; x_wconf 96">payable</span> <span class="ocrx_word" id="word_1_87" title="bbox 386 1662 408 1679; x_wconf 96">to</span> <span class="ocrx_word" id="word_1_88" title="bbox 424 1660 478 1679; x_wconf 96">Duke</span> <span class="ocrx_word" id="word_1_89" title="bbox 488 1660 553 1685; x_wconf 96">Realty</span> <span class="ocrx_word" id="word_1_90" title="bbox 561 1661 697 1685; x_wconf 96">Corporation.</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_45" title="bbox 116 1696 1289 1721">
	<p class="ocr_par" id="par_1_45" lang="eng" title="bbox 116 1696 1289 1721"><span class="ocr_line" id="line_1_45" title="bbox 116 1696 1289 1721; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_91" title="bbox 116 1696 128 1715; x_wconf 89">If</span> <span class="ocrx_word" id="word_1_92" title="bbox 135 1702 173 1721; x_wconf 96">you</span> <span class="ocrx_word" id="word_1_93" title="bbox 184 1696 233 1715; x_wconf 96">have</span> <span class="ocrx_word" id="word_1_94" title="bbox 242 1702 280 1721; x_wconf 96">any</span> <span class="ocrx_word" id="word_1_95" title="bbox 288 1697 393 1721; x_wconf 96">questions</span> <span class="ocrx_word" id="word_1_96" title="bbox 401 1697 521 1721; x_wconf 96">concerning</span> <span class="ocrx_word" id="word_1_97" title="bbox 531 1696 568 1715; x_wconf 95">this</span> <span class="ocrx_word" id="word_1_98" title="bbox 577 1697 653 1715; x_wconf 95">Invoice</span> <span class="ocrx_word" id="word_1_99" title="bbox 663 1696 730 1721; x_wconf 96">please</span> <span class="ocrx_word" id="word_1_100" title="bbox 739 1698 819 1715; x_wconf 96">contact</span> <span class="ocrx_word" id="word_1_101" title="bbox 835 1697 876 1715; x_wconf 96">Lina</span> <span class="ocrx_word" id="word_1_102" title="bbox 887 1696 989 1721; x_wconf 96">Upchurch</span> <span class="ocrx_word" id="word_1_103" title="bbox 999 1702 1025 1715; x_wconf 94">on</span> <span class="ocrx_word" id="word_1_104" title="bbox 1044 1697 1083 1715; x_wconf 94">+52</span> <span class="ocrx_word" id="word_1_105" title="bbox 1093 1697 1145 1719; x_wconf 96">(915)</span> <span class="ocrx_word" id="word_1_106" title="bbox 1155 1697 1256 1715; x_wconf 96">649-1513</span> <span class="ocrx_word" id="word_1_107" title="bbox 1266 1702 1289 1715; x_wconf 96">or</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_46" title="bbox 116 1732 375 1757">
	<p class="ocr_par" id="par_1_46" lang="eng" title="bbox 116 1732 375 1757"><span class="ocr_line" id="line_1_46" title="bbox 116 1732 375 1757; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_108" title="bbox 116 1732 375 1757; x_wconf 91">lupchurchgf@<a class="external-link" href="http://lycos.com" rel="nofollow">lycos.com</a></span> </span></p>
	 </div>
	</div>
 </body>
</html>

A list of entities is the result of the model execution. An entity consists of a label name, count index, label content, and OCR words that match the entity region.

{
	"entities": [
		{
			"name": "DebitNoteId",
			"words": [
				{
					"bbox": [
						0.8227450980392157,
						0.1696969696969697,
						0.9168627450980392,
						0.18262626262626264
					],
					"id": "page0_area2_paragraph2_line3_word15",
					"page": 0,
					"content": "DNT6268231"
				}
			],
			"index": 0,
			"score": 1,
			"content": "DNT6268231"
		},
		{
			"name": "DebitNoteDate",
			"words": [
				{
					"bbox": [
						0.8227450980392157,
						0.19818181818181818,
						0.92,
						0.21151515151515152
					],
					"id": "page0_area2_paragraph2_line4_word23",
					"page": 0,
					"content": "2018-06-30"
				}
			],
			"index": 0,
			"score": 0.97,
			"content": "2018-06-30"
		}
	]
}

HTML source

The input document are HTML (TXT files are converted into HTML)

{
  "html_src": "https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/cb8b87a4-d08a-4307-913c-0243ec6f684d/1eda92d9-5027-4526-ae40-bcf985c4c4f7_minified.html"
}

Tagged HTML is the result of model execution. Tagged HTML consist of rpa-selection tags with information about label and order (multiple case)

<html>
	<head>
	</head>
	<body>
		<div class="grid-body">
			<div class="invoice-title">
				<div class="row">
					<div class="col-xs-12"></div>
				</div><br />
				<div class="row">
					<div class="col-xs-12">
						<h2>invoice<br /><span class="small">order #<rpa-selection data-content="1097"
									data-order="0" data-type="Order Number">1097</rpa-selection></span></h2>
					</div>
				</div>
			</div>
			<hr />
			<div class="row">
				<div class="col-xs-6">
					<address><strong>Billed To:</strong><br />
						<rpa-selection data-content="Costco Wholesale" data-order="0" data-type="Name">
							Costco Wholesale</rpa-selection><br />
						<rpa-selection data-content="999 Lake Drive" data-order="0" data-type="Addresses">
							999 Lake Drive</rpa-selection><br />
						<rpa-selection data-content="Issaquah, WA 98027" data-order="0"
							data-type="Addresses">Issaquah, WA 98027</rpa-selection><br /><abbr
							title="Phone">P:</abbr>
						<rpa-selection data-content="(222) 417-0141" data-order="0"
							data-type="Phone Numbers">(222) 417-0141</rpa-selection>
					</address>
				</div>
			</div>
		</div>
	</body>
</html>

Spacy IE Models

Platform uses spacy NLP inside for data processing for the following models:

  • ml_ie_spacy2_model
  • ml_ie_spacy3_model
  • ml_iehtml_spacy2_model
  • ml_iehtml_spacy3_model

Information Extraction as a Pipeline

Information Extraction process is implemented in EasyRPA as a pipeline. There is more to this pipeline than ML models: platform also includes several options for extending ML with rules and dictionaries.  

Taking a closer look at both processes, let's investigate what stages are part of each: model training and execution.  

Model Training Process

Model training

This step of EasyRPA involves training the ML model using the provided training set. The system automatically shuffles the provided set, runs training for a specified number of iterations, and selects the best model.

Process developer can specify a model type, number of training iterations, etc. using a configuration JSON file.

Package creation

The trained model comes packaged with configuration files and uploaded to the Nexus repo. 

Information Extraction Process

Model execution

The model is run once for each document.

Model Training Configuration File

To train a Spacy Information Extraction models you need to provide a JSON that defines configuration parameters for the training process.

Let's take a closer look at these configuration settings.

  • trainer_name(string)(required) - a python artifact that produces model packages for processing with a specific model type. There are two modules in it: a module for training on tagged data and generating a trained model package, and a module for downloading the trained model from the Nexus or from the cache and running it on the input data. Please, refer to Out of the box IE models and Out of the box IEHTML models for more details.
  • trainer_version(string)(required) - a trainer version. Please, refer to Out of the box IE models and Out of the box IEHTML models for more details.
  • trainer_description(string)(required) - a trainer description.
  • lang(string)(optional) - the language of input data. The default value is 'en'.
  • iterations(number)(optional) - number of iterations of model training on a given training set. The default value is '30'.
  • concat_single_entities(boolean)(optional) - Spacy know nothing about sinle/multiple entity, for it they are always multiple. This flag uses labels configuration (from train_config.json) to concatinates spacy entities with the same name into one string.
  • post_processing_rules(list of objects)(optional) - after NER extraction model uses EntityMatcher with rules defined in post_processing_rules.json. Configuration JSON should contain a list of label names with regular expressions for searching for entities.
  • base_model_patterns(list of objects)(optional) - used to configure EntityRuler for labeling datum elements. It runs before fetching data and provides model with additional information on the document structure increasing accuracy of data extraction.


    "trainer_name": "easyrpaml_ie_spacy3_model",
    "trainer_version": "3.3.0",
    "trainer_description": "Information Extraction",
    "train_config": {
        "bucket": "mld",
        "lang": "en",
        "iterations": 5,
        "base_model_patterns": [{
                "label": "kwDebitNoteID",
                "id": "kwDebitNoteID",
                "pattern": [{
                        "TEXT": "Debit"
                    }, {
                        "TEXT": {
                            "REGEX": "^Note.*$"
                        }

                    }, {
                        "TEXT": "#"
                    }, {
                        "TEXT": ":"
                    }
                ]
            }, {
                "label": "kwDebitNoteID",
                "id": "kwDebitNoteID",
                "pattern": [{
                        "TEXT": "Debit"
                    }, {
                        "TEXT": {
                            "REGEX": "^Note.*$"
                        }
                    }, {
                        "TEXT": ":"
                    }
                ]
            }, {
                "label": "kwDebitNoteDate",
                "id": "kwDebitNoteDate",
                "pattern": [{
                        "TEXT": "Debit"
                    }, {
                        "TEXT": "Note"
                    }, {
                        "TEXT": "Date"
                    }, {
                        "TEXT": {
                            "REGEX": "^[:|;]$"
                        }
                    }
                ]
            }
        ]
    },
    "process_config": {
        "concat_single_entities": true,
        "post_processing_rules": [{
                "label": "kwDebitNoteID",
                "regex": [
                    "Debit",
                    "^Note.*$",
                    "#",
                    ":"
                ]
            }
        ]
    }

}

Model Training Data File

To train a Spacy Information Extraction models system provides train_data.json.

{
    "data": [{
            "documentId": "c7ac32c6-41dd-40d7-b025-914b48ada78e",
            "text": ["INVOICE ay DATE INVOICE NO Park City Group DC 087 Jackson Drive 06 Oct, 2020 8203044765 Washington, 86-723 +86 (824) 519-7851 citizens@corp.com INVOICE TO Mechel PAO 6 lowa Avenue Turkmenbasy, 90451 +358 (589) 364-9582 ulopemi@deviantart.com SALESPERSON JOB PAYMENT TERMS DUE DATE Due on Receipt 06 Dec, 2020 QUANTITY DESCRIPTION UNIT PRICE LINE TOTAL 10.00 Wine $60.00 $600.00 Bliss tasty winter organic Subtotal $ 600.00 Discount 10.00% $ 64.20 Sales Tax 7.00% $ 42.00 Total $ 577.80 TRANSFER DETAILS Bank Transfer Account Number Routing Number BANK Hostess Brands, Inc. 7040654064 1335 1487660076 "],
            "documentEntities": [{
                    "name": "Invoice Date",
                    "index": 0,
                    "words": [{
                            "id": "page0_area6_paragraph6_line6_word12",
                            "bbox": [0.08224543080939947, 0.1432896064581231, 0.10574412532637076, 0.15590312815338042],
                            "page": 0,
                            "content": "06"
                        }, {
                            "id": "page0_area6_paragraph6_line6_word13",
                            "bbox": [0.11488250652741515, 0.1432896064581231, 0.15274151436031333, 0.1579212916246216],
                            "page": 0,
                            "content": "Oct,"
                        }, {
                            "id": "page0_area6_paragraph6_line6_word14",
                            "bbox": [0.1612271540469974, 0.1432896064581231, 0.21018276762402088, 0.15590312815338042],
                            "page": 0,
                            "content": "2020"
                        }
                    ],
                    "content": "06 Oct, 2020",
                    "score": 1.0
                }
            ],
            "ocr": ["<!--?xml version="1.0" encoding="UTF-8"?--> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head>  <title></title>  <meta http-equiv="Content-Type" content="text/html;charset=utf-8">  <meta name="ocr-system" content="tesseract 5.3.0">  <meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf"> </head> <body>  <div class="ocr_page" id="page_1" title="image &quot;c7ac32c6-41dd-40d7-b025-914b48ada78e.pdf_000.jpg&quot;; bbox 0 0 1532 1982; ppageno 0; scan_res 180 180">   <div class="ocr_carea" id="block_1_1" title="bbox 133 88 361 133">    <p class="ocr_par" id="par_1_1" lang="eng" title="bbox 133 88 361 133"><span class="ocr_line" id="line_1_1" title="bbox 133 88 361 133; baseline 0.004 -1; x_size 58.5; x_descenders 14.625; x_ascenders 14.625"> <span class="ocrx_word" id="word_1_1" title="bbox 133 88 361 133; x_wconf 95">INVOICE</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_2" title="bbox 1053 74 1180 204">    <p class="ocr_par" id="par_1_2" lang="eng" title="bbox 1053 74 1180 204"><span class="ocr_line" id="line_1_2" title="bbox 1053 74 1180 204; baseline 0 0; x_size 174.66667; x_descenders 43.666668; x_ascenders 43.666668"> <span class="ocrx_word" id="word_1_2" title="bbox 1053 74 1180 204; x_wconf 61">ay</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_3" title="bbox 118 230 196 254">    <p class="ocr_par" id="par_1_3" lang="eng" title="bbox 118 230 196 254"><span class="ocr_line" id="line_1_3" title="bbox 118 230 196 254; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_3" title="bbox 118 230 196 254; x_wconf 96">DATE</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_4" title="bbox 512 230 689 254">    <p class="ocr_par" id="par_1_4" lang="eng" title="bbox 512 230 689 254"><span class="ocr_line" id="line_1_4" title="bbox 512 230 689 254; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_4" title="bbox 512 230 636 254; x_wconf 96">INVOICE</span> <span class="ocrx_word" id="word_1_5" title="bbox 648 230 689 254; x_wconf 96">NO</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_5" title="bbox 1059 230 1308 255">    <p class="ocr_par" id="par_1_5" lang="eng" title="bbox 1059 230 1308 255"><span class="ocr_line" id="line_1_5" title="bbox 1059 230 1308 255; baseline 0 -5; x_size 25; x_descenders 5; x_ascenders 5"> <span class="ocrx_word" id="word_1_6" title="bbox 1059 230 1118 250; x_wconf 96">Park</span> <span class="ocrx_word" id="word_1_7" title="bbox 1127 230 1176 255; x_wconf 96">City</span> <span class="ocrx_word" id="word_1_8" title="bbox 1185 230 1261 255; x_wconf 96">Group</span> <span class="ocrx_word" id="word_1_9" title="bbox 1272 230 1308 250; x_wconf 96">DC</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_6" title="bbox 1059 275 1302 297">    <p class="ocr_par" id="par_1_6" lang="eng" title="bbox 1059 275 1302 297"><span class="ocr_line" id="line_1_6" title="bbox 1059 275 1302 297; baseline 0 0; x_size 27.32258; x_descenders 5.3225803; x_ascenders 7"> <span class="ocrx_word" id="word_1_10" title="bbox 1059 275 1106 297; x_wconf 96">087</span> <span class="ocrx_word" id="word_1_11" title="bbox 1117 275 1224 297; x_wconf 95">Jackson</span> <span class="ocrx_word" id="word_1_12" title="bbox 1236 275 1302 297; x_wconf 95">Drive</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_7" title="bbox 126 284 322 313">    <p class="ocr_par" id="par_1_7" lang="eng" title="bbox 126 284 322 313"><span class="ocr_line" id="line_1_7" title="bbox 126 284 322 313; baseline 0 -4; x_size 33.809525; x_descenders 8.4523811; x_ascenders 8.4523811"> <span class="ocrx_word" id="word_1_13" title="bbox 126 284 162 309; x_wconf 96">06</span> <span class="ocrx_word" id="word_1_14" title="bbox 176 284 234 313; x_wconf 96">Oct,</span> <span class="ocrx_word" id="word_1_15" title="bbox 247 284 322 309; x_wconf 96">2020</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_8" title="bbox 520 284 712 309">    <p class="ocr_par" id="par_1_8" lang="eng" title="bbox 520 284 712 309"><span class="ocr_line" id="line_1_8" title="bbox 520 284 712 309; baseline 0 0; x_size 34; x_descenders 8.5; x_ascenders 8.5"> <span class="ocrx_word" id="word_1_16" title="bbox 520 284 712 309; x_wconf 96">8203044765</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_9" title="bbox 1058 320 1333 349">    <p class="ocr_par" id="par_1_9" lang="eng" title="bbox 1058 320 1333 349"><span class="ocr_line" id="line_1_9" title="bbox 1058 320 1333 349; baseline 0 -7; x_size 26; x_descenders 4; x_ascenders 7"> <span class="ocrx_word" id="word_1_17" title="bbox 1058 320 1222 349; x_wconf 96">Washington,</span> <span class="ocrx_word" id="word_1_18" title="bbox 1242 320 1333 342; x_wconf 96">86-723</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_10" title="bbox 1059 364 1316 393">    <p class="ocr_par" id="par_1_10" lang="eng" title="bbox 1059 364 1316 393"><span class="ocr_line" id="line_1_10" title="bbox 1059 364 1316 393; baseline 0 -7; x_size 29.952381; x_descenders 7.4880953; x_ascenders 7.4880953"> <span class="ocrx_word" id="word_1_19" title="bbox 1059 364 1107 386; x_wconf 96">+86</span> <span class="ocrx_word" id="word_1_20" title="bbox 1119 364 1185 393; x_wconf 96">(824)</span> <span class="ocrx_word" id="word_1_21" title="bbox 1196 364 1316 386; x_wconf 93">519-7851</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_11" title="bbox 1059 408 1308 437">    <p class="ocr_par" id="par_1_11" lang="eng" title="bbox 1059 408 1308 437"><span class="ocr_line" id="line_1_11" title="bbox 1059 408 1308 437; baseline 0 -7; x_size 29; x_descenders 7; x_ascenders 7"> <span class="ocrx_word" id="word_1_22" title="bbox 1059 408 1308 437; x_wconf 92">citizens@corp.com</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_12" title="bbox 118 451 265 471">    <p class="ocr_par" id="par_1_12" lang="eng" title="bbox 118 451 265 471"><span class="ocr_line" id="line_1_12" title="bbox 118 451 265 471; baseline 0 0; x_size 27.333334; x_descenders 6.8333335; x_ascenders 6.8333335"> <span class="ocrx_word" id="word_1_23" title="bbox 118 451 224 471; x_wconf 94">INVOICE</span> <span class="ocrx_word" id="word_1_24" title="bbox 232 451 265 471; x_wconf 94">TO</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_13" title="bbox 125 495 283 515">    <p class="ocr_par" id="par_1_13" lang="eng" title="bbox 125 495 283 515"><span class="ocr_line" id="line_1_13" title="bbox 125 495 283 515; baseline 0 0; x_size 25.32258; x_descenders 5.3225803; x_ascenders 5"> <span class="ocrx_word" id="word_1_25" title="bbox 125 495 219 515; x_wconf 96">Mechel</span> <span class="ocrx_word" id="word_1_26" title="bbox 230 495 283 515; x_wconf 96">PAO</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_14" title="bbox 125 540 320 562">    <p class="ocr_par" id="par_1_14" lang="eng" title="bbox 125 540 320 562"><span class="ocr_line" id="line_1_14" title="bbox 125 540 320 562; baseline 0 0; x_size 27.32258; x_descenders 5.3225803; x_ascenders 7"> <span class="ocrx_word" id="word_1_27" title="bbox 125 540 139 562; x_wconf 91">6</span> <span class="ocrx_word" id="word_1_28" title="bbox 151 540 209 562; x_wconf 92">lowa</span> <span class="ocrx_word" id="word_1_29" title="bbox 220 540 320 562; x_wconf 96">Avenue</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_15" title="bbox 124 584 407 613">    <p class="ocr_par" id="par_1_15" lang="eng" title="bbox 124 584 407 613"><span class="ocr_line" id="line_1_15" title="bbox 124 584 407 613; baseline 0 -7; x_size 26; x_descenders 4; x_ascenders 7"> <span class="ocrx_word" id="word_1_30" title="bbox 124 584 311 613; x_wconf 90">Turkmenbasy,</span> <span class="ocrx_word" id="word_1_31" title="bbox 331 584 407 606; x_wconf 96">90451</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_16" title="bbox 125 629 403 658">    <p class="ocr_par" id="par_1_16" lang="eng" title="bbox 125 629 403 658"><span class="ocr_line" id="line_1_16" title="bbox 125 629 403 658; baseline 0 -7; x_size 29.955555; x_descenders 7.4888887; x_ascenders 7.4888887"> <span class="ocrx_word" id="word_1_32" title="bbox 125 629 190 651; x_wconf 96">+358</span> <span class="ocrx_word" id="word_1_33" title="bbox 202 629 268 658; x_wconf 96">(589)</span> <span class="ocrx_word" id="word_1_34" title="bbox 278 629 403 651; x_wconf 96">364-9582</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_17" title="bbox 126 673 452 702">    <p class="ocr_par" id="par_1_17" lang="eng" title="bbox 126 673 452 702"><span class="ocr_line" id="line_1_17" title="bbox 126 673 452 702; baseline 0 -7; x_size 29; x_descenders 7; x_ascenders 7"> <span class="ocrx_word" id="word_1_35" title="bbox 126 673 452 702; x_wconf 90">ulopemi@deviantart.com</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_18" title="bbox 123 793 354 817">    <p class="ocr_par" id="par_1_18" lang="eng" title="bbox 123 793 354 817"><span class="ocr_line" id="line_1_18" title="bbox 123 793 354 817; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_36" title="bbox 123 793 354 817; x_wconf 95">SALESPERSON</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_19" title="bbox 441 793 500 817">    <p class="ocr_par" id="par_1_19" lang="eng" title="bbox 441 793 500 817"><span class="ocr_line" id="line_1_19" title="bbox 441 793 500 817; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_37" title="bbox 441 793 500 817; x_wconf 96">JOB</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_20" title="bbox 795 793 1068 817">    <p class="ocr_par" id="par_1_20" lang="eng" title="bbox 795 793 1068 817"><span class="ocr_line" id="line_1_20" title="bbox 795 793 1068 817; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_38" title="bbox 795 793 947 817; x_wconf 95">PAYMENT</span> <span class="ocrx_word" id="word_1_39" title="bbox 957 793 1068 817; x_wconf 95">TERMS</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_21" title="bbox 1193 793 1348 817">    <p class="ocr_par" id="par_1_21" lang="eng" title="bbox 1193 793 1348 817"><span class="ocr_line" id="line_1_21" title="bbox 1193 793 1348 817; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_40" title="bbox 1193 793 1256 817; x_wconf 96">DUE</span> <span class="ocrx_word" id="word_1_41" title="bbox 1269 793 1348 817; x_wconf 96">DATE</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_22" title="bbox 795 851 947 872">    <p class="ocr_par" id="par_1_22" lang="eng" title="bbox 795 851 947 872"><span class="ocr_line" id="line_1_22" title="bbox 795 851 947 872; baseline 0 -5; x_size 21; x_descenders 5; x_ascenders 5"> <span class="ocrx_word" id="word_1_42" title="bbox 795 851 832 867; x_wconf 96">Due</span> <span class="ocrx_word" id="word_1_43" title="bbox 842 856 863 867; x_wconf 95">on</span> <span class="ocrx_word" id="word_1_44" title="bbox 874 851 947 872; x_wconf 95">Receipt</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_23" title="bbox 1198 851 1329 870">    <p class="ocr_par" id="par_1_23" lang="eng" title="bbox 1198 851 1329 870"><span class="ocr_line" id="line_1_23" title="bbox 1198 851 1329 870; baseline 0 -3; x_size 21.809525; x_descenders 5.4523811; x_ascenders 5.4523811"> <span class="ocrx_word" id="word_1_45" title="bbox 1198 851 1220 867; x_wconf 96">06</span> <span class="ocrx_word" id="word_1_46" title="bbox 1230 851 1272 870; x_wconf 96">Dec,</span> <span class="ocrx_word" id="word_1_47" title="bbox 1281 851 1329 867; x_wconf 96">2020</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_24" title="bbox 124 956 539 985">    <p class="ocr_par" id="par_1_24" lang="eng" title="bbox 124 956 539 985"><span class="ocr_line" id="line_1_24" title="bbox 124 956 539 985; baseline 0 -5; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_48" title="bbox 124 956 277 985; x_wconf 94">QUANTITY</span> <span class="ocrx_word" id="word_1_49" title="bbox 327 956 539 980; x_wconf 94">DESCRIPTION</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_25" title="bbox 957 956 1136 980">    <p class="ocr_par" id="par_1_25" lang="eng" title="bbox 957 956 1136 980"><span class="ocr_line" id="line_1_25" title="bbox 957 956 1136 980; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_50" title="bbox 957 956 1029 980; x_wconf 95">UNIT</span> <span class="ocrx_word" id="word_1_51" title="bbox 1041 956 1136 980; x_wconf 96">PRICE</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_26" title="bbox 1205 956 1382 980">    <p class="ocr_par" id="par_1_26" lang="eng" title="bbox 1205 956 1382 980"><span class="ocr_line" id="line_1_26" title="bbox 1205 956 1382 980; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_52" title="bbox 1205 956 1274 980; x_wconf 96">LINE</span> <span class="ocrx_word" id="word_1_53" title="bbox 1284 956 1382 980; x_wconf 96">TOTAL</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_27" title="bbox 130 1055 184 1071">    <p class="ocr_par" id="par_1_27" lang="eng" title="bbox 130 1055 184 1071"><span class="ocr_line" id="line_1_27" title="bbox 130 1055 184 1071; baseline 0 0; x_size 22; x_descenders 5.5; x_ascenders 5.5"> <span class="ocrx_word" id="word_1_54" title="bbox 130 1055 184 1071; x_wconf 96">10.00</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_28" title="bbox 332 1047 383 1063">    <p class="ocr_par" id="par_1_28" lang="eng" title="bbox 332 1047 383 1063"><span class="ocr_line" id="line_1_28" title="bbox 332 1047 383 1063; baseline 0 0; x_size 21.279999; x_descenders 5.2799997; x_ascenders 4"> <span class="ocrx_word" id="word_1_55" title="bbox 332 1047 383 1063; x_wconf 96">Wine</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_29" title="bbox 962 1053 1030 1074">    <p class="ocr_par" id="par_1_29" lang="eng" title="bbox 962 1053 1030 1074"><span class="ocr_line" id="line_1_29" title="bbox 962 1053 1030 1074; baseline 0 -3; x_size 22.166666; x_descenders 5.5416665; x_ascenders 5.5416665"> <span class="ocrx_word" id="word_1_56" title="bbox 962 1053 1030 1074; x_wconf 96">$60.00</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_30" title="bbox 1209 1053 1290 1074">    <p class="ocr_par" id="par_1_30" lang="eng" title="bbox 1209 1053 1290 1074"><span class="ocr_line" id="line_1_30" title="bbox 1209 1053 1290 1074; baseline 0 -3; x_size 22.133333; x_descenders 5.5333333; x_ascenders 5.5333333"> <span class="ocrx_word" id="word_1_57" title="bbox 1209 1053 1290 1074; x_wconf 96">$600.00</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_31" title="bbox 333 1079 582 1101">    <p class="ocr_par" id="par_1_31" lang="eng" title="bbox 333 1079 582 1101"><span class="ocr_line" id="line_1_31" title="bbox 333 1079 582 1101; baseline 0 -5; x_size 21; x_descenders 5; x_ascenders 4"> <span class="ocrx_word" id="word_1_58" title="bbox 333 1079 375 1096; x_wconf 96">Bliss</span> <span class="ocrx_word" id="word_1_59" title="bbox 381 1082 430 1100; x_wconf 96">tasty</span> <span class="ocrx_word" id="word_1_60" title="bbox 437 1080 502 1096; x_wconf 96">winter</span> <span class="ocrx_word" id="word_1_61" title="bbox 509 1080 582 1101; x_wconf 96">organic</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_32" title="bbox 1010 1441 1117 1463">    <p class="ocr_par" id="par_1_32" lang="eng" title="bbox 1010 1441 1117 1463"><span class="ocr_line" id="line_1_32" title="bbox 1010 1441 1117 1463; baseline 0 0; x_size 27.32258; x_descenders 5.3225803; x_ascenders 7"> <span class="ocrx_word" id="word_1_62" title="bbox 1010 1441 1117 1463; x_wconf 96">Subtotal</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_33" title="bbox 1322 1442 1410 1462">    <p class="ocr_par" id="par_1_33" lang="eng" title="bbox 1322 1442 1410 1462"><span class="ocr_line" id="line_1_33" title="bbox 1322 1442 1410 1462; baseline -0.023 0; x_size 22.133333; x_descenders 5.5333333; x_ascenders 5.5333333"> <span class="ocrx_word" id="word_1_63" title="bbox 1322 1442 1332 1462; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_64" title="bbox 1342 1444 1410 1460; x_wconf 96">600.00</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_34" title="bbox 886 1477 1117 1499">    <p class="ocr_par" id="par_1_34" lang="eng" title="bbox 886 1477 1117 1499"><span class="ocr_line" id="line_1_34" title="bbox 886 1477 1117 1499; baseline 0 0; x_size 27.32258; x_descenders 5.3225803; x_ascenders 7"> <span class="ocrx_word" id="word_1_65" title="bbox 886 1477 1001 1499; x_wconf 96">Discount</span> <span class="ocrx_word" id="word_1_66" title="bbox 1020 1477 1117 1499; x_wconf 94">10.00%</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_35" title="bbox 1335 1478 1410 1498">    <p class="ocr_par" id="par_1_35" lang="eng" title="bbox 1335 1478 1410 1498"><span class="ocr_line" id="line_1_35" title="bbox 1335 1478 1410 1498; baseline -0.027 0; x_size 22.166666; x_descenders 5.5416665; x_ascenders 5.5416665"> <span class="ocrx_word" id="word_1_67" title="bbox 1335 1478 1345 1498; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_68" title="bbox 1355 1480 1410 1496; x_wconf 96">64.20</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_36" title="bbox 886 1513 1117 1535">    <p class="ocr_par" id="par_1_36" lang="eng" title="bbox 886 1513 1117 1535"><span class="ocr_line" id="line_1_36" title="bbox 886 1513 1117 1535; baseline 0 0; x_size 27.32258; x_descenders 5.3225803; x_ascenders 7"> <span class="ocrx_word" id="word_1_69" title="bbox 886 1513 959 1535; x_wconf 96">Sales</span> <span class="ocrx_word" id="word_1_70" title="bbox 968 1513 1018 1535; x_wconf 95">Tax</span> <span class="ocrx_word" id="word_1_71" title="bbox 1035 1513 1117 1535; x_wconf 95">7.00%</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_37" title="bbox 1335 1514 1410 1534">    <p class="ocr_par" id="par_1_37" lang="eng" title="bbox 1335 1514 1410 1534"><span class="ocr_line" id="line_1_37" title="bbox 1335 1514 1410 1534; baseline -0.027 0; x_size 22.166666; x_descenders 5.5416665; x_ascenders 5.5416665"> <span class="ocrx_word" id="word_1_72" title="bbox 1335 1514 1345 1534; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_73" title="bbox 1354 1516 1410 1532; x_wconf 96">42.00</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_38" title="bbox 1052 1561 1117 1583">    <p class="ocr_par" id="par_1_38" lang="eng" title="bbox 1052 1561 1117 1583"><span class="ocr_line" id="line_1_38" title="bbox 1052 1561 1117 1583; baseline 0 0; x_size 27.32258; x_descenders 5.3225803; x_ascenders 7"> <span class="ocrx_word" id="word_1_74" title="bbox 1052 1561 1117 1583; x_wconf 96">Total</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_39" title="bbox 1321 1561 1411 1582">    <p class="ocr_par" id="par_1_39" lang="eng" title="bbox 1321 1561 1411 1582"><span class="ocr_line" id="line_1_39" title="bbox 1321 1561 1411 1582; baseline -0.022 0; x_size 23.466667; x_descenders 5.8666668; x_ascenders 5.8666668"> <span class="ocrx_word" id="word_1_75" title="bbox 1321 1561 1333 1582; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_76" title="bbox 1341 1563 1411 1580; x_wconf 96">577.80</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_40" title="bbox 123 1672 426 1696">    <p class="ocr_par" id="par_1_40" lang="eng" title="bbox 123 1672 426 1696"><span class="ocr_line" id="line_1_40" title="bbox 123 1672 426 1696; baseline 0 0; x_size 32.666668; x_descenders 8.166667; x_ascenders 8.166667"> <span class="ocrx_word" id="word_1_77" title="bbox 123 1672 287 1696; x_wconf 95">TRANSFER</span> <span class="ocrx_word" id="word_1_78" title="bbox 300 1672 426 1696; x_wconf 96">DETAILS</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_41" title="bbox 125 1741 295 1761">    <p class="ocr_par" id="par_1_41" lang="eng" title="bbox 125 1741 295 1761"><span class="ocr_line" id="line_1_41" title="bbox 125 1741 295 1761; baseline 0 0; x_size 25.32258; x_descenders 5.3225803; x_ascenders 5"> <span class="ocrx_word" id="word_1_79" title="bbox 125 1741 185 1761; x_wconf 96">Bank</span> <span class="ocrx_word" id="word_1_80" title="bbox 192 1741 295 1761; x_wconf 96">Transfer</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_42" title="bbox 556 1741 760 1761">    <p class="ocr_par" id="par_1_42" lang="eng" title="bbox 556 1741 760 1761"><span class="ocr_line" id="line_1_42" title="bbox 556 1741 760 1761; baseline 0 0; x_size 25.32258; x_descenders 5.3225803; x_ascenders 5"> <span class="ocrx_word" id="word_1_81" title="bbox 556 1741 655 1761; x_wconf 96">Account</span> <span class="ocrx_word" id="word_1_82" title="bbox 664 1741 760 1761; x_wconf 96">Number</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_43" title="bbox 990 1741 1188 1767">    <p class="ocr_par" id="par_1_43" lang="eng" title="bbox 990 1741 1188 1767"><span class="ocr_line" id="line_1_43" title="bbox 990 1741 1188 1767; baseline 0 -6; x_size 26; x_descenders 6; x_ascenders 5"> <span class="ocrx_word" id="word_1_83" title="bbox 990 1741 1082 1767; x_wconf 95">Routing</span> <span class="ocrx_word" id="word_1_84" title="bbox 1092 1741 1188 1761; x_wconf 96">Number</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_44" title="bbox 125 1773 463 1797">    <p class="ocr_par" id="par_1_44" lang="eng" title="bbox 125 1773 463 1797"><span class="ocr_line" id="line_1_44" title="bbox 125 1773 463 1797; baseline 0 -4; x_size 24; x_descenders 4; x_ascenders 5"> <span class="ocrx_word" id="word_1_85" title="bbox 125 1773 197 1793; x_wconf 94">BANK</span> <span class="ocrx_word" id="word_1_86" title="bbox 214 1773 311 1793; x_wconf 94">Hostess</span> <span class="ocrx_word" id="word_1_87" title="bbox 321 1773 411 1797; x_wconf 95">Brands,</span> <span class="ocrx_word" id="word_1_88" title="bbox 423 1773 463 1793; x_wconf 96">Inc.</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_45" title="bbox 564 1773 777 1793">    <p class="ocr_par" id="par_1_45" lang="eng" title="bbox 564 1773 777 1793"><span class="ocr_line" id="line_1_45" title="bbox 564 1773 777 1793; baseline 0 0; x_size 27.333334; x_descenders 6.8333335; x_ascenders 6.8333335"> <span class="ocrx_word" id="word_1_89" title="bbox 564 1773 715 1793; x_wconf 90">7040654064</span> <span class="ocrx_word" id="word_1_90" title="bbox 720 1773 777 1793; x_wconf 90">1335</span> </span></p>   </div>   <div class="ocr_carea" id="block_1_46" title="bbox 1000 1773 1148 1793">    <p class="ocr_par" id="par_1_46" lang="eng" title="bbox 1000 1773 1148 1793"><span class="ocr_line" id="line_1_46" title="bbox 1000 1773 1148 1793; baseline 0 0; x_size 27.333334; x_descenders 6.8333335; x_ascenders 6.8333335"> <span class="ocrx_word" id="word_1_91" title="bbox 1000 1773 1148 1793; x_wconf 96">1487660076</span> </span></p>   </div>  </div> </body> </html>"]
        }
    ],
    "labels": {
        "Invoice Number": false,
        "Invoice Date": false,
        "Due Date": false,
        "Company Name": false,
        "Street Address": false,
        "City": false,
        "Zip Code": false,
        "Phone Number": false,
        "E-mail": false,
        "Product Name": true,
        "Product Description": true,
        "Quantity": true,
        "Price": true,
        "Tax Rate": false,
        "Discount Rate": false,
        "Total Discount": false,
        "Total Amount": false
    }
}

Where:

  • data - the list of documents with tagged entities, provided by Information Extraction HTT
  • labels(list of objects)(optional) - labels are added to the NER pipe at the training stage. In case of empty configuration all labels found in the training dataset will be automatically added to the model, and the output dimension will be inferred automatically (expensive operation). The multiplicity flag affects how the entity index is calculated at processing stage. Index of labels with multiplicity equals True increments through the whole document while for labels with False multiplicity index is always zero. 

OpenAI IE Models

The OpenAI model uses OpenAI API to call LLM for request processing. The idea of such models are minifies (depending of specified renderer in model configuration) input document (HTML or HOCR), then send request to OpenAI that extract fields (or something else) and provide the result in CSV format.

Curentlly platform has the following OpenAI IE models:

  • ml_ie_openai_model - uses hOCR source base 
  • ml_iehtml_openai_model - uses HTML source base

The models do not need a training, so do not require training data, but support a train operation. The result of the training will be a trained model with a default prompts specified during training. It is kind of prompts versioning.

ml_ie_openai_model

This model minifies (depending of hocr2html rendering selected in model configuration) hOCR html, then send to OpenAI request like this:

You are a good expert of extracting data from invoice documents. You receive HTML document as the result of OCR processing of scanned invoice, and the list of fields you should extract.
As an output you have to provide csv file with two columns: field tag and list of HTML tags "id" property. Pay attention that one extracted field may have several tags.
For table items provide a separate line for each row.
For example:
###BEGIN OF EXAMPLE
User ask you to extract:
```
Find all accounts in the balance sheet and for each item found extract:
	- company name with tag COMPANY
	- account with tag ACCOUNT
	- balance with tag BALANCE
Do not tag table headers.
```
Your input HTML is:
```html
<html>
	<body>
		<p>
			<div><span id="word_0_1">Remittance</span><span id="word_0_2">Advice</span></div>
			<div><span id="word_0_3">Company:</span> <span id="word_0_4">IBA</span><span id="word_0_5">Group</span></div>
			<div><span id="word_0_6">Income</span><span id="word_0_7">Fund</span></div>
		</p>
		<p>
			<div><span id="word_1_1">ACCOUNTS</span><span id="word_1_2">BALANCE</span></div>
			<div><span id="word_1_3">12341234</span><span id="word_1_4">$5000</span></div>
			<div><span id="word_1_5">22354123</span><span id="word_1_6">$1000</span></div>
		</p>
</body>
</html>
```
Your answer always should be a only valid csv file without any comment, do not ommit headers, always use " for values escaping:
```csv
"field_name","tag_id"
"COMPANY","word_0_4,word_0_5"
"ACCOUNT","word_1_3"
"BALANCE","word_1_4"
"ACCOUNT","word_1_5"
"BALANCE","word_1_6"
```
Note, that you 
###END OF EXAMPLE

Now your task is the following:
```
Find all items in the invoice and for each item found extract:
	- item name with tag PRODUCT. If there is no item in the invoice, split the description into item and description: where item it is the first sentence in the description
	- description with tag DESCRIPTION
	- unit price with tag PRICE.
	- quantity with tag QUANTITY
Do not tag table headers. Combine multiple lines of description tag into one tag if possible.
Also extract invoice information:
	- Company name of the client with tag CLIENT
	- Client address with tag ADDRESS
	- Invoice number with tag INVOICENUMBER
	- Date of issue with tag ISSUED
	- Due Date with tag DUE_DATE
	- Total amount, TOTAL
```
Your input HTML is:
```html
{html}
```

The OpenAI request is customizable, how to do this we explains below.

Model Training

Training proces creates a new model with default promtps configuration. The trainer do not use the training data, the only training configuration will be used. Here is sample model training configuration.

{
	<span style="color: rgb(135,16,148);">"trainer_name"</span>: <span style="color: rgb(6,125,23);">"ml_ie_openai_model"</span>,
	<span style="color: rgb(135,16,148);">"trainer_version"</span>: <span style="color: rgb(6,125,23);">"3.3.0"</span>,
	<span style="color: rgb(135,16,148);">"trainer_description"</span>: <span style="color: rgb(6,125,23);">"HOCR Information Extraction with OpenAI"</span>,
	<span style="color: rgb(135,16,148);">"trainer_data_required"</span>: <span style="color: rgb(0,51,179);">false</span>,
	<span style="color: rgb(135,16,148);">"prompts_config"</span>: {
		<span style="color: rgb(135,16,148);">"debug"</span>: <span style="color: rgb(0,51,179);">false</span>,
		<span style="color: rgb(135,16,148);">"messages"</span>: [
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"system"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{systemRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"user"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{userRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>}
		],
		<span style="color: rgb(135,16,148);">"systemRolePrompt"</span>: <span style="color: rgb(6,125,23);">"You are a good expert of extracting data from invoice documents. You receive HTML document as the result of OCR processing of scanned invoice, and the list of fields you should extract.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">As an output you have to provide csv file with two columns: field tag and list of HTML tags </span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">id</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);"> property. Pay attention that one extracted field may have several tags.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">For table items provide a separate line for each row.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">For example:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">###BEGIN OF EXAMPLE</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">User ask you to extract:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Find all accounts in the balance sheet and for each item found extract:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- company name with tag COMPANY</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- account with tag ACCOUNT</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- balance with tag BALANCE</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Do not tag table headers.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your input HTML is:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```html</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);"><html></span><span style="color: rgb(0,55,166);">\n\t</span><span style="color: rgb(6,125,23);"><body></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"><p></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><div><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_1</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Remittance</span><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_2</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Advice</span></div></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><div><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_3</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Company:</span> <span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_4</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>IBA</span><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_5</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Group</span></div></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><div><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_6</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Income</span><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_7</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Fund</span></div></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"></p></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"><p></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><div><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_1</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>ACCOUNTS</span><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_2</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>BALANCE</span></div></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><div><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_3</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>12341234</span><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_4</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>$5000</span></div></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><div><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_5</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>22354123</span><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_6</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>$1000</span></div></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"></p></span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);"></body></span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);"></html></span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your answer always should be a only valid csv file without any comment, do not ommit headers, always use </span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);"> for values escaping:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```csv</span><span style="color: rgb(0,55,166);">\n\"</span><span style="color: rgb(6,125,23);">field_name</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">tag_id</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">COMPANY</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_0_4,word_0_5</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">ACCOUNT</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_3</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">BALANCE</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_4</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">ACCOUNT</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_5</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">BALANCE</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">word_1_6</span><span style="color: rgb(0,55,166);">\"\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Note, that you </span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">###END OF EXAMPLE"</span>,
		<span style="color: rgb(135,16,148);">"userRolePrompt"</span>: <span style="color: rgb(6,125,23);">"Now your task is the following:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Find all items in the invoice and for each item found extract:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- item name with tag PRODUCT. If there is no item in the invoice, split the description into item and description: where item it is the first sentence in the description</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- description with tag DESCRIPTION</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- unit price with tag PRICE.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- quantity with tag QUANTITY</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Do not tag table headers. Combine multiple lines of description tag into one tag if possible.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Also extract invoice information:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Company name of the client with tag CLIENT</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Client address with tag ADDRESS</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Invoice number with tag INVOICENUMBER</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Date of issue with tag ISSUED</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Due Date with tag DUE_DATE</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Total amount, TOTAL</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your input HTML is:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```html</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">{html}</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```"</span>,
		<span style="color: rgb(135,16,148);">"html"</span>: <span style="color: rgb(6,125,23);">""</span>,
		<span style="color: rgb(135,16,148);">"environment"</span>: <span style="color: rgb(6,125,23);">"OpenAI"</span>,
		<span style="color: rgb(135,16,148);">"temperature"</span>: <span style="color: rgb(23,80,235);">0</span>,
		<span style="color: rgb(135,16,148);">"open_ai_model"</span>: <span style="color: rgb(6,125,23);">"gpt-4o"</span>,
		<span style="color: rgb(135,16,148);">"track_into_langfuse"</span>: <span style="color: rgb(0,51,179);">false</span>,
		<span style="color: rgb(135,16,148);">"hocr2html"</span>: {
			<span style="color: rgb(135,16,148);">"type"</span>: <span style="color: rgb(6,125,23);">"table"</span>,
			<span style="color: rgb(135,16,148);">"bbox_to_cell_tolerance_x"</span>: <span style="color: rgb(23,80,235);">10</span>,
			<span style="color: rgb(135,16,148);">"bbox_to_cell_tolerance_y"</span>: <span style="color: rgb(23,80,235);">10</span>,
			<span style="color: rgb(135,16,148);">"cell_to_row_tolerance"</span>: <span style="color: rgb(23,80,235);">20</span>,
			<span style="color: rgb(135,16,148);">"row_to_table_tolerance"</span>: <span style="color: rgb(23,80,235);">10
</span><span style="color: rgb(23,80,235);">		</span>},
		<span style="color: rgb(135,16,148);">"concat_single_entities"</span>: <span style="color: rgb(0,51,179);">true</span>,
		<span style="color: rgb(135,16,148);">"entities"</span>: [
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"PRODUCT"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Product Name"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"DESCRIPTION"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Product Description"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"QUANTITY"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Quantity"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"PRICE"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Price"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"CLIENT"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Company Name"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"ADDRESS"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Street Address"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"INVOICENUMBER"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Invoice Number"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"ISSUED"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Invoice Date"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"DUE_DATE"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Due Date"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"TOTAL"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Total Amount"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>}
		]
	}
}

where:

  • prompts_config - the default prompts configuration saved into trained model
  • messages - a prompt messages structure to use during sending to OpenAI API
  • html - the document simplified html that model creates and injected into prompt context
  • environment - a secret vault aliace where stored JSON with environment variables to set, before call the LLM API
  • temperature - the request temperature, depends of LLM model, ussually can be gradated like: Coding / Math - 0.0; Data Cleaning / Data Analysis 1.0; Creative Writing / Poetry - 1.5

  • open_ai_model - an OpenAI model to use, required
  • track_into_langfuse - track the OpenAI conversation into Langfuse if true
  • entities - an entity to response tag mapping to map OpenAI tagged document into documents entities. The single flag is using to process concat_single_entities.
  • debug - boolean switches debug messages on
  • concat_single_entities (boolean)(optional) - This flag uses entities configuration to concatinates entities with the same name into one string.
  • hocr2html - HOCR to html rendering configuration

OpenAI models environment

To use OpenAI API you need to specify LLM provider url and access token in environment variables (OPENAI_BASE_URL, OPENAI_API_KEY).

The model configuration defines the environment variable that is aliace of Secret Vault record that contains environment variable JSON to set befor use OpenAI API.

Here is the JSON template for it:

{
	"OPENAI_BASE_URL": "https://a_llm_host.org",
	"OPENAI_API_KEY": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

Prompts configuration

The prompts_config is a map of parameter the model use to create a OpenAI request. Model get it from:

  • configuration parameter of the MlTask call
  • model default configuration

The MlTask configuration parameter overrides the existing model default configuration, i.e. you can add only a changes iteration into MlTask and keep the existing from default.

Here is a platform task code that prepare Ml call:

MlTaskData mlTaskData = new MlTaskData(modelName, modelVersion);
mlTaskData.getConfiguration().putAll(documentContext.getMlConfiguration());
. . . . .
default Map getMlConfiguration() {
	 return (Map) getSettings().getOrDefault("mlConfiguration", new HashMap());
}

So to pass the promts into the model, you need to specify mlConfiguration map in document set settings of document processor configuraion:

Or in the configuration parameter of AP that uses datastore document context:

The messages parameter defines a promt structure for OpenAI request. Here is models python code that call OpenAI:

client = OpenAI()

prompt_completion = client.chat.completions.create(
	model=openai_model,
	messages=messages,
	temperature=0,
)
openai_response = str(prompt_completion.choices[0].message.content)

The default messages structure is required, here is default structure:

{
	"messages": [{
			"role": "system",
			"content": "{systemRolePrompt}"
		}, {
			"role": "user",
			"content": "{userRolePrompt}"
		}
	],
}

It sends request with system ( {systemRolePrompt} ) and user {userRolePrompt} ) roles. The {systemRolePrompt} and {userRolePrompt} are refers to keys from the promts configuration.

Only one level key references are allowed in the promts configuration.

The html key is injected by the model and contains minified document.

You can completlly change the default messages structure, or redefine the systemRolePrompt and userRolePrompt.

The userRolePrompt always need to be changed according to your document set and fields you need to extract. It contains field description to extract for OpenAI.

HOCR to HTML rendering configuration

The hocr2html parameters specify a simplified text rendering algotitm, that is defined by hocr2html.type key. There are the following rendering exist:

  • default - put word in a order htat is exist in HOCR
  • table - put words according to recognized table layout
  • table-rows - using table render to obtain table layout and put words according to rows flow, without cells separation

Default HOCR to HTML rendering (default)

The default rendering uses the HOCR tags normal ordering to provide output with the following rules:

  • <div class="ocr_page"> → <p>
  • <span class="ocr_line"> → <div>
  • <span class="ocrx_word"> → <span id="word_[Page  index]_[Word index on page]">[Word]</span>

Here is a typical rendered html:

<html>
	<body>
		<p>
			<div><span id="word_0_1">Remittance</span><span id="word_0_2">Advice</span></div>
			<div><span id="word_0_3">Company:</span> <span id="word_0_4">IBA</span><span id="word_0_5">Group</span></div>
			<div><span id="word_0_6">Income</span><span id="word_0_7">Fund</span></div>
		</p>
		<p>
			<div><span id="word_1_1">ACCOUNTS</span><span id="word_1_2">BALANCE</span></div>
			<div><span id="word_1_3">12341234</span><span id="word_1_4">$5000</span></div>
			<div><span id="word_1_5">22354123</span><span id="word_1_6">$1000</span></div>
		</p>
	</body>
</html>

Table HOCR to HTML rendering (table)

This renderer groups HOCR bboxes into cells,rows and tables like on the following pictures:

The renderer uses the following settings:

"hocr2html": {
	"type": "table",
	"bbox_to_cell_tolerance_x": 10,
	"bbox_to_cell_tolerance_y": 10,
	"cell_to_row_tolerance": 20,
	"row_to_table_tolerance": 10
	},

Where:

  • bbox_to_cell_tolerance_x - a max width in pixels between 2 bboxes that are belongs to a same table cell
  • bbox_to_cell_tolerance_y - a max height in pixels between 2 bboxes that are belongs to a same table cell
  • cell_to_row_tolerance - a max height in pixels between 2 cells that are belongs to a same row
  • row_to_table_tolerance- a max height in pixels between 2 rows that are belongs to a same table

The renderer do the following:

  • tries to combine bboxes into cells using bbox_to_cell_tolerance_x and bbox_to_cell_tolerance_y
  • then combines cells into rows using cell_to_row_tolerance
  • then combines rows into tables using row_to_table_tolerance
  • renders words according to cell order

The debug=true upload to storage a debug JPG with table layout:

It saves page elements:

  • <div class="ocr_page"> → <p>

Here is a typical rendered html:

<html>
	<body>
		<p>
			<table>
				<tr>
					<td>
						<span id="word_0_3">DATE</span>
					</td>
					<td>
						<span id="word_0_13">08</span>
						<span id="word_0_14">Mar,</span>
						<span id="word_0_15">2020</span>
					</td>
					<td>
						<span id="word_0_4">INVOICE</span>
						<span id="word_0_5">NO</span>
					</td>
					<td>
						<span id="word_0_16">4453074013</span>
					</td>
					<td>
						<span id="word_0_6">Park</span>
						<span id="word_0_7">City</span>
						<span id="word_0_8">Group</span>
						<span id="word_0_9">DC</span>
						<span id="word_0_10">087</span>
						<span id="word_0_11">Jackson</span>
						<span id="word_0_12">Drive</span>
						<span id="word_0_17">Washington,</span>
						<span id="word_0_18">86-723</span>
						<span id="word_0_19">+86</span>
						<span id="word_0_20">(824)</span>
						<span id="word_0_21">519-7851</span>
						<span id="word_0_22">citizens@corp.com</span>
					</td>
				</tr>
			</table>
		</p>
	</body>
</html>

Table-Rows HOCR to HTML rendering (table-rows)

This renderer uses the same table page grouping mechanizm as table redering, but instead of puting <table> into result html, fill out only rows without cell groupping:

  • <div class="ocr_page"> → <p>
  • row → <div>
  • <span class="ocrx_word"> → <span id="word_[Page  index]_[Word index on page]">[Word]</span>

Here is a typical rendered html:

<html>
	<body>
		<p>
			<div>
				<span id="word_0_1">INVOICE</span>
			</div>
			<div>
				<span id="word_0_48">QUANTITY</span>
				<span id="word_0_49">DESCRIPTION</span>
				<span id="word_0_50">UNIT</span>
				<span id="word_0_51">PRICE</span>
				<span id="word_0_52">LINE</span>
				<span id="word_0_53">TOTAL</span>
			</div>
			<div>
				<span id="word_0_54">19.00</span>
				<span id="word_0_55">Initation</span>
				<span id="word_0_56">crab</span>
				<span id="word_0_57">meat</span>
				<span id="word_0_60">Mountain</span>
				<span id="word_0_61">food</span>
				<span id="word_0_62">magic</span>
				<span id="word_0_63">healthy</span>
				<span id="word_0_64">yummy</span>
				<span id="word_0_65">food</span>
				<span id="word_0_58">$150.00</span>
				<span id="word_0_59">$2850.00</span>
			</div>
		</p>
	</body>
</html>

Langfuse integration

The OpenAI models also supports Langfuse integration, where you can track you LLM request and pricing:

For this you should specify the LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY environment variables:

Here is the JSON template for it:

{
	"OPENAI_BASE_URL": "https://a_llm_host.org",
	"OPENAI_API_KEY": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
	"LANGFUSE_HOST": "http://10.25.64.83:3000",
	"LANGFUSE_PUBLIC_KEY": "pk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
	"LANGFUSE_SECRET_KEY": "sk-lf-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}

Sample

The Intelligent Document Processing (IDP) contains document set IDP_SAMPLE_INVOICE_OPENAI that configured to work with the ml_ie_openai_model

ml_iehtml_openai_model

The model has very similar functionality as the ml_ie_openai_model. The difference are in the HTML to minified HTML rendering and default prompts. Here is a default request that model sends to OpenAI:

You are a good expert of extracting data from HTML documents. You receive HTML document and the list of fields you should to extract.
The provided document contains tags with id attributes, you should find the requested fields as text and provide a csv csv file with tree columns: 
field_name - the field name 
area_id - the id attribute value of a closes tag where the requested field has been found, the only one id should be specified, noticed that not all tags contains the id attribute, the closes parent with id should be used
text - the exact text value of the requested field, the column value could contain less text than the closes tag 
Pay attention that one extracted field could has been found in many places.
For table items provide a separate line for each row.
For example:
###BEGIN OF EXAMPLE
User ask you to extract:
```
Find all accounts in the balance sheet and for each item found extract:
	- company name with tag COMPANY
	- account with tag ACCOUNT
	- balance with tag BALANCE
Do not tag table headers.
```
Your input HTML is:
```html
<html>
	<body>
		<p id="6">
			<span id="0">Remittance Advice</span>
			<span id="1">Company: IBA Group</span>
			<span id="2">Income Fund</span>
		</p>
		<p id="7">
			<span id="3">ACCOUNTS BALANCE</span>
			<span id="4">12341234 $5000</span>
			<span id="5">22354123 $1000</span>
			<span>34567890 $200</span>
		</p>
</body>
</html>
```
Your answer always should be a only valid csv file without any comment, do not ommit headers, always use " for values escaping:
```csv
"field_name","area_id","text"
"COMPANY","1","IBA Group"
"ACCOUNT","4","12341234"
"BALANCE","4","$5000"
"ACCOUNT","5","22354123"
"BALANCE","6","$1000"
"ACCOUNT","7","34567890"
"BALANCE","7","$200"
```
###END OF EXAMPLE


Now your task is the following:
```
Find all items in the invoice and for each item found extract:
	- item name with tag PRODUCT. If there is no item in the invoice, split the description into item and description: where item it is the first sentence in the description
	- description with tag DESCRIPTION
	- unit price with tag PRICE.
	- quantity with tag QUANTITY
Do not tag table headers. Combine multiple lines of description tag into one tag if possible.
Also extract invoice information:
	- Company name of the client with tag CLIENT
	- Client address with tag ADDRESS
	- Invoice number with tag INVOICENUMBER
	- Date of issue with tag ISSUED
	- Due Date with tag DUE_DATE
	- Total amount, TOTAL
```
Your input HTML is:
```html
{html}
```

Model Training

Here is the models default training config:

{
	<span style="color: rgb(135,16,148);">"trainer_name"</span>: <span style="color: rgb(6,125,23);">"ml_iehtml_openai_model"</span>,
	<span style="color: rgb(135,16,148);">"trainer_version"</span>: <span style="color: rgb(6,125,23);">"3.3.0"</span>,
	<span style="color: rgb(135,16,148);">"trainer_description"</span>: <span style="color: rgb(6,125,23);">"HTML Information Extraction with OpenAI"</span>,
	<span style="color: rgb(135,16,148);">"trainer_data_required"</span>: <span style="color: rgb(0,51,179);">false</span>,
	<span style="color: rgb(135,16,148);">"prompts_config"</span>: {
		<span style="color: rgb(135,16,148);">"debug"</span>: <span style="color: rgb(0,51,179);">true</span>,
		<span style="color: rgb(135,16,148);">"messages"</span>: [
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"system"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{systemRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"role"</span>: <span style="color: rgb(6,125,23);">"user"</span>,
				<span style="color: rgb(135,16,148);">"content"</span>: <span style="color: rgb(6,125,23);">"{userRolePrompt}"
</span><span style="color: rgb(6,125,23);">			</span>}
		],
		<span style="color: rgb(135,16,148);">"systemRolePrompt"</span>: <span style="color: rgb(6,125,23);">"You are a good expert of extracting data from HTML documents. You receive HTML document and the list of fields you should to extract.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">The provided document contains tags with id attributes, you should find the requested fields as text and provide a csv csv file with tree columns: </span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">field_name - the field name </span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">area_id - the id attribute value of a closes tag where the requested field has been found, the only one id should be specified, noticed that not all tags contains the id attribute, the closes parent with id should be used</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">text - the exact text value of the requested field, the column value could contain less text than the closes tag </span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Pay attention that one extracted field could has been found in many places, and if you find an item that belogng to different areas, you should split it into lines with their ids.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">For table items provide a separate line for each row.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">For example:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">###BEGIN OF EXAMPLE</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">User ask you to extract:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Find all accounts in the balance sheet and for each item found extract:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- company name with tag COMPANY</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- account with tag ACCOUNT</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- balance with tag BALANCE</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Do not tag table headers.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your input HTML is:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```html</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);"><html></span><span style="color: rgb(0,55,166);">\n\t</span><span style="color: rgb(6,125,23);"><body></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"><p id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">6</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">0</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Remittance Advice</span></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">1</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Company: IBA Group</span></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">2</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>Income Fund</span></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"></p></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"><p id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">7</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">3</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>ACCOUNTS BALANCE</span></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">4</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>12341234 $5000</span></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><span id=</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">5</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">>22354123 $1000</span></span><span style="color: rgb(0,55,166);">\n\t\t\t</span><span style="color: rgb(6,125,23);"><span>34567890 $200</span></span><span style="color: rgb(0,55,166);">\n\t\t</span><span style="color: rgb(6,125,23);"></p></span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);"></body></span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);"></html></span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your answer always should be a only valid csv file without any comment, do not ommit headers, always use </span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);"> for values escaping:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```csv</span><span style="color: rgb(0,55,166);">\n\"</span><span style="color: rgb(6,125,23);">field_name</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">area_id</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">text</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">COMPANY</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">1</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">IBA Group</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">ACCOUNT</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">4</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">12341234</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">BALANCE</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">4</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">$5000</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">ACCOUNT</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">5</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">22354123</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">BALANCE</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">6</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">$1000</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">ACCOUNT</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">7</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">34567890</span><span style="color: rgb(0,55,166);">\"\n\"</span><span style="color: rgb(6,125,23);">BALANCE</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">7</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">,</span><span style="color: rgb(0,55,166);">\"</span><span style="color: rgb(6,125,23);">$200</span><span style="color: rgb(0,55,166);">\"\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">###END OF EXAMPLE"</span>,
		<span style="color: rgb(135,16,148);">"userRolePrompt"</span>: <span style="color: rgb(6,125,23);">"Now your task is the following:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Find all items in the invoice and for each item found extract:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- item name with tag PRODUCT. If there is no item in the invoice, split the description into item and description: where item it is the first sentence in the description</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- description with tag DESCRIPTION</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- unit price with tag PRICE.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- quantity with tag QUANTITY</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Do not tag table headers. Combine multiple lines of description tag into one tag if possible.</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Also extract invoice information:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Company name of the client with tag CLIENT</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Client address with tag ADDRESS</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Invoice number with tag INVOICENUMBER</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Date of issue with tag ISSUED</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Due Date with tag DUE_DATE</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">	- Total amount, TOTAL</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">Your input HTML is:</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```html</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">{html}</span><span style="color: rgb(0,55,166);">\n</span><span style="color: rgb(6,125,23);">```"</span>,
		<span style="color: rgb(135,16,148);">"html"</span>: <span style="color: rgb(6,125,23);">""</span>,
		<span style="color: rgb(135,16,148);">"environment"</span>: <span style="color: rgb(6,125,23);">"OpenAI"</span>,
		<span style="color: rgb(135,16,148);">"open_ai_model"</span>: <span style="color: rgb(6,125,23);">"gpt-4o"</span>,
		<span style="color: rgb(135,16,148);">"temperature"</span>: <span style="color: rgb(23,80,235);">0</span>,
		<span style="color: rgb(135,16,148);">"track_into_langfuse"</span>: <span style="color: rgb(0,51,179);">false</span>,
		<span style="color: rgb(135,16,148);">"elements_with_id"</span>: <span style="color: rgb(6,125,23);">"p"</span>,
		<span style="color: rgb(135,16,148);">"elements_to_delete"</span>: <span style="color: rgb(6,125,23);">"head, style, img"</span>,
		<span style="color: rgb(135,16,148);">"elements_to_unwrap"</span>: <span style="color: rgb(6,125,23);">"tbody, div, span, b, i, a"</span>,
		<span style="color: rgb(135,16,148);">"concat_single_entities"</span>: <span style="color: rgb(0,51,179);">true</span>,
		<span style="color: rgb(135,16,148);">"entities"</span>: [
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"PRODUCT"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Product Name"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"DESCRIPTION"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Product Description"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"QUANTITY"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Quantity"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"PRICE"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Price"
</span><span style="color: rgb(6,125,23);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"CLIENT"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Company Name"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"ADDRESS"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Street Address"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"INVOICENUMBER"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Invoice Number"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"ISSUED"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Invoice Date"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"DUE_DATE"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Due Date"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>},
			{
				<span style="color: rgb(135,16,148);">"tag"</span>: <span style="color: rgb(6,125,23);">"TOTAL"</span>,
				<span style="color: rgb(135,16,148);">"name"</span>: <span style="color: rgb(6,125,23);">"Total Amount"</span>,
				<span style="color: rgb(135,16,148);">"single"</span>: <span style="color: rgb(0,51,179);">true
</span><span style="color: rgb(0,51,179);">			</span>}
		]
	}
}

where:

  • prompts_config - the default prompts configuration saved into trained model
  • messages - a prompt messages structure to use during sending to OpenAI API
  • html - the document simplified html that model creates and injected into prompt context
  • environment - a secret vault aliace where stored JSON with environment variables to set, before call the LLM API
  • temperature - the request temperature, depends of LLM model, ussually can be gradated like: Coding / Math - 0.0; Data Cleaning / Data Analysis 1.0; Creative Writing / Poetry - 1.5

  • open_ai_model - an OpenAI model to use, required
  • track_into_langfuse - track the OpenAI conversation into Langfuse if true
  • entities- an entity to response tag mapping to map OpenAI tagged document into documents entities. The single flag is using to process concat_single_entities.
  • debug - boolean switches debug messages on
  • concat_single_entities (boolean)(optional) - This flag uses entities configuration to concatinates entities with the same name into one string.
  • elements_with_id - selector for the html element to mark with id
  • elements_to_delete - selector for the elements to delete from xml

  • elements_to_unwrap - selector for the elements to unwrap

HTML to minified HTML rendering

Here is a sample of a document minification:

The rendering algotitm is the following:

  • add id attribute to the all elements specified by the elements_with_id selector
  • delete all elements specified by the elements_to_delete selector
  • unwrap all elements specified by the elements_to_unwrap selector
  • remove all empty elements

Sample

The Information Extraction HTML Sample contains document sets IE_HTML_OPENAI SAMPLE and IE_HTML_LOAN_OPENAI_SAMPLE that configured to work with the ml_iehtml_openai_model.

The Information Extraction TEXT Sample contains document set IE_TEXT_OPENAI SAMPLE that configured to work with the ml_iehtml_openai_model.