Skip to main content

Information Extraction Models

Information Extraction Models

Overview

Information extraction (IE) is the automated retrieval of specific information related to a selected topic from input data. Information extraction tools make it possible to pull information from text documents, databases, websites or multiple sources.

EasyRPA provides infrastructure to create and run machine learning models that extract information from PDF, images, TXT and HTML documents.

Curentlly platform has the following IE set of models:

  • hOCR source base 
  • HTML source base

hOCR source

The input document are PDF and images that are converted into hOCR using platforms OCR.

{

	"images": [
	{
		"text_src": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.txt</span>",
		"json_src": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.json</span>",
		"hocr_src": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.html</span>",
		"content": "<span class="nolink">https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/670bed93-6492-4d35-8f6f-e8ae76e80b39/17697330-2f97-4e97-98e2-6021f2b65bfe.pdf.ocr-page000.jpg</span>",
		"dimensions": {
		"width": "1532",
		"height": "1982"
		}
	}
	]
}

Where image[] - are the pages of a source document, and for every page are:

  • text_src - text from OCR
  • hocr_src - the OCR result in hOCR format
  • json_src - the hOCR file reperesented as JSON for IE Human Task Type
  • content - the source page image
<!--?xml version="1.0" encoding="UTF-8"?-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "<a class="external-link" href="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" rel="nofollow">http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd</a>">
<html xmlns="<a class="external-link" href="http://www.w3.org/1999/xhtml" rel="nofollow">http://www.w3.org/1999/xhtml</a>" xml:lang="en" lang="en">
 <head>
	<title></title>
	<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
	<meta name="ocr-system" content="tesseract 5.3.0">
	<meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf">
 </head>
 <body>
	<div class="ocr_page" id="page_1" title="image &quot;17697330-2f97-4e97-98e2-6021f2b65bfe.pdf_000.jpg&quot;; bbox 0 0 1532 1982; ppageno 0; scan_res 180 180">
	 <div class="ocr_carea" id="block_1_1" title="bbox 1078 114 1460 148">
	<p class="ocr_par" id="par_1_1" lang="eng" title="bbox 1078 114 1460 148"><span class="ocr_line" id="line_1_1" title="bbox 1078 114 1460 148; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 8"> <span class="ocrx_word" id="word_1_1" title="bbox 1078 115 1189 140; x_wconf 95">Boston</span> <span class="ocrx_word" id="word_1_2" title="bbox 1204 114 1275 140; x_wconf 95">Park</span> <span class="ocrx_word" id="word_1_3" title="bbox 1285 114 1349 148; x_wconf 96">City</span> <span class="ocrx_word" id="word_1_4" title="bbox 1359 115 1460 148; x_wconf 95">Group</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_2" title="bbox 1207 157 1459 178">
	<p class="ocr_par" id="par_1_2" lang="eng" title="bbox 1207 157 1459 178"><span class="ocr_line" id="line_1_2" title="bbox 1207 157 1459 178; baseline 0 0; x_size 26.310345; x_descenders 5.3103447; x_ascenders 7"> <span class="ocrx_word" id="word_1_5" title="bbox 1207 158 1278 178; x_wconf 96">03093</span> <span class="ocrx_word" id="word_1_6" title="bbox 1289 158 1413 178; x_wconf 95">Tennessee</span> <span class="ocrx_word" id="word_1_7" title="bbox 1425 157 1459 178; x_wconf 95">Hill</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_3" title="bbox 1274 195 1460 219">
	<p class="ocr_par" id="par_1_3" lang="eng" title="bbox 1274 195 1460 219"><span class="ocr_line" id="line_1_3" title="bbox 1274 195 1460 219; baseline 0 -4; x_size 24; x_descenders 4; x_ascenders 6"> <span class="ocrx_word" id="word_1_8" title="bbox 1274 195 1360 219; x_wconf 95">Boston,</span> <span class="ocrx_word" id="word_1_9" title="bbox 1378 195 1460 215; x_wconf 96">38-512</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_4" title="bbox 1240 231 1461 255">
	<p class="ocr_par" id="par_1_4" lang="eng" title="bbox 1240 231 1461 255"><span class="ocr_line" id="line_1_4" title="bbox 1240 231 1461 255; baseline 0 -4; x_size 27.282051; x_descenders 6.8205128; x_ascenders 6.8205128"> <span class="ocrx_word" id="word_1_10" title="bbox 1240 231 1270 251; x_wconf 96">+7</span> <span class="ocrx_word" id="word_1_11" title="bbox 1280 231 1337 255; x_wconf 96">(752)</span> <span class="ocrx_word" id="word_1_12" title="bbox 1349 231 1461 251; x_wconf 96">199-8334</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_5" title="bbox 1237 267 1460 294">
	<p class="ocr_par" id="par_1_5" lang="eng" title="bbox 1237 267 1460 294"><span class="ocr_line" id="line_1_5" title="bbox 1237 267 1460 294; baseline 0 -6; x_size 27; x_descenders 6; x_ascenders 7"> <span class="ocrx_word" id="word_1_13" title="bbox 1237 267 1460 294; x_wconf 91">allstate@<a class="external-link" href="http://corp.com" rel="nofollow">corp.com</a></span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_6" title="bbox 931 413 1116 432">
	<p class="ocr_par" id="par_1_6" lang="eng" title="bbox 931 413 1116 432"><span class="ocr_line" id="line_1_6" title="bbox 931 413 1116 432; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_14" title="bbox 931 413 1013 432; x_wconf 95">Invoice</span> <span class="ocrx_word" id="word_1_15" title="bbox 1022 413 1116 432; x_wconf 95">Number:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_7" title="bbox 1159 414 1291 432">
	<p class="ocr_par" id="par_1_7" lang="eng" title="bbox 1159 414 1291 432"><span class="ocr_line" id="line_1_7" title="bbox 1159 414 1291 432; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_16" title="bbox 1159 414 1208 432; x_wconf 86">8603</span> <span class="ocrx_word" id="word_1_17" title="bbox 1215 414 1291 432; x_wconf 86">163534</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_8" title="bbox 121 412 349 465">
	<p class="ocr_par" id="par_1_8" lang="eng" title="bbox 121 412 349 465"><span class="ocr_line" id="line_1_8" title="bbox 121 412 349 465; baseline 0.004 -1; x_size 58.426666; x_descenders 5.4266663; x_ascenders 16"> <span class="ocrx_word" id="word_1_18" title="bbox 121 412 349 465; x_wconf 94">Invoice</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_9" title="bbox 972 463 1117 482">
	<p class="ocr_par" id="par_1_9" lang="eng" title="bbox 972 463 1117 482"><span class="ocr_line" id="line_1_9" title="bbox 972 463 1117 482; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_19" title="bbox 972 463 1055 482; x_wconf 96">Invoice</span> <span class="ocrx_word" id="word_1_20" title="bbox 1064 464 1117 482; x_wconf 89">Date:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_10" title="bbox 1168 464 1290 485">
	<p class="ocr_par" id="par_1_10" lang="eng" title="bbox 1168 464 1290 485"><span class="ocr_line" id="line_1_10" title="bbox 1168 464 1290 485; baseline 0 -3; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_21" title="bbox 1168 464 1290 485; x_wconf 96">12/02/2020</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_11" title="bbox 961 514 1117 538">
	<p class="ocr_par" id="par_1_11" lang="eng" title="bbox 961 514 1117 538"><span class="ocr_line" id="line_1_11" title="bbox 961 514 1117 538; baseline 0 -6; x_size 24; x_descenders 6; x_ascenders 5"> <span class="ocrx_word" id="word_1_22" title="bbox 961 514 1063 538; x_wconf 96">Payment</span> <span class="ocrx_word" id="word_1_23" title="bbox 1072 514 1117 532; x_wconf 87">Due:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_12" title="bbox 1168 514 1290 535">
	<p class="ocr_par" id="par_1_12" lang="eng" title="bbox 1168 514 1290 535"><span class="ocr_line" id="line_1_12" title="bbox 1168 514 1290 535; baseline 0 -3; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_24" title="bbox 1168 514 1290 535; x_wconf 96">12/04/2020</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_13" title="bbox 124 556 390 581">
	<p class="ocr_par" id="par_1_13" lang="eng" title="bbox 124 556 390 581"><span class="ocr_line" id="line_1_13" title="bbox 124 556 390 581; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_25" title="bbox 124 556 178 575; x_wconf 96">Duke</span> <span class="ocrx_word" id="word_1_26" title="bbox 187 556 253 581; x_wconf 96">Realty</span> <span class="ocrx_word" id="word_1_27" title="bbox 261 557 390 581; x_wconf 96">Corporation</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_14" title="bbox 123 589 257 608">
	<p class="ocr_par" id="par_1_14" lang="eng" title="bbox 123 589 257 608"><span class="ocr_line" id="line_1_14" title="bbox 123 589 257 608; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_28" title="bbox 123 590 147 608; x_wconf 96">86</span> <span class="ocrx_word" id="word_1_29" title="bbox 156 590 204 608; x_wconf 96">Vera</span> <span class="ocrx_word" id="word_1_30" title="bbox 213 589 257 608; x_wconf 96">Trail</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_15" title="bbox 122 623 256 644">
	<p class="ocr_par" id="par_1_15" lang="eng" title="bbox 122 623 256 644"><span class="ocr_line" id="line_1_15" title="bbox 122 623 256 644; baseline 0 -3; x_size 21; x_descenders 3; x_ascenders 5"> <span class="ocrx_word" id="word_1_31" title="bbox 122 623 176 644; x_wconf 95">Vera,</span> <span class="ocrx_word" id="word_1_32" title="bbox 194 623 256 641; x_wconf 94">11305</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_16" title="bbox 125 657 352 679">
	<p class="ocr_par" id="par_1_16" lang="eng" title="bbox 125 657 352 679"><span class="ocr_line" id="line_1_16" title="bbox 125 657 352 679; baseline 0 -4; x_size 24.622223; x_descenders 6.1555557; x_ascenders 6.1555557"> <span class="ocrx_word" id="word_1_33" title="bbox 125 657 178 675; x_wconf 96">+506</span> <span class="ocrx_word" id="word_1_34" title="bbox 188 657 240 679; x_wconf 96">(796)</span> <span class="ocrx_word" id="word_1_35" title="bbox 250 657 352 675; x_wconf 96">883-5554</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_17" title="bbox 124 689 360 714">
	<p class="ocr_par" id="par_1_17" lang="eng" title="bbox 124 689 360 714"><span class="ocr_line" id="line_1_17" title="bbox 124 689 360 714; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_36" title="bbox 124 689 360 714; x_wconf 72">Icossans8i@<a class="external-link" href="http://dmoz.org" rel="nofollow">dmoz.org</a></span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_18" title="bbox 967 713 1117 731">
	<p class="ocr_par" id="par_1_18" lang="eng" title="bbox 967 713 1117 731"><span class="ocr_line" id="line_1_18" title="bbox 967 713 1117 731; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_37" title="bbox 967 713 1063 731; x_wconf 96">Amount</span> <span class="ocrx_word" id="word_1_38" title="bbox 1072 713 1117 731; x_wconf 96">Due:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_19" title="bbox 1153 711 1249 734">
	<p class="ocr_par" id="par_1_19" lang="eng" title="bbox 1153 711 1249 734"><span class="ocr_line" id="line_1_19" title="bbox 1153 711 1249 734; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_39" title="bbox 1153 711 1164 734; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_40" title="bbox 1174 713 1249 731; x_wconf 96">504.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_20" title="bbox 74 821 124 839">
	<p class="ocr_par" id="par_1_20" lang="eng" title="bbox 74 821 124 839"><span class="ocr_line" id="line_1_20" title="bbox 74 821 124 839; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_41" title="bbox 74 821 124 839; x_wconf 95">Item</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_21" title="bbox 355 820 492 845">
	<p class="ocr_par" id="par_1_21" lang="eng" title="bbox 355 820 492 845"><span class="ocr_line" id="line_1_21" title="bbox 355 820 492 845; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_42" title="bbox 355 820 492 845; x_wconf 96">Description</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_22" title="bbox 950 820 1055 845">
	<p class="ocr_par" id="par_1_22" lang="eng" title="bbox 950 820 1055 845"><span class="ocr_line" id="line_1_22" title="bbox 950 820 1055 845; baseline -0.019 -4; x_size 26; x_descenders 6; x_ascenders 7"> <span class="ocrx_word" id="word_1_43" title="bbox 950 820 1055 845; x_wconf 96">Quantity</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_23" title="bbox 1165 820 1224 839">
	<p class="ocr_par" id="par_1_23" lang="eng" title="bbox 1165 820 1224 839"><span class="ocr_line" id="line_1_23" title="bbox 1165 820 1224 839; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_44" title="bbox 1165 820 1224 839; x_wconf 96">Price</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_24" title="bbox 1348 820 1409 839">
	<p class="ocr_par" id="par_1_24" lang="eng" title="bbox 1348 820 1409 839"><span class="ocr_line" id="line_1_24" title="bbox 1348 820 1409 839; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_45" title="bbox 1348 820 1409 839; x_wconf 96">Total</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_25" title="bbox 81 892 265 911">
	<p class="ocr_par" id="par_1_25" lang="eng" title="bbox 81 892 265 911"><span class="ocr_line" id="line_1_25" title="bbox 81 892 265 911; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_46" title="bbox 81 892 129 911; x_wconf 96">Beef</span> <span class="ocrx_word" id="word_1_47" title="bbox 138 892 202 911; x_wconf 96">cheek</span> <span class="ocrx_word" id="word_1_48" title="bbox 212 892 265 911; x_wconf 96">fresh</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_26" title="bbox 361 892 754 917">
	<p class="ocr_par" id="par_1_26" lang="eng" title="bbox 361 892 754 917"><span class="ocr_line" id="line_1_26" title="bbox 361 892 754 917; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_49" title="bbox 361 893 430 917; x_wconf 96">Scoop</span> <span class="ocrx_word" id="word_1_50" title="bbox 440 893 524 917; x_wconf 93">organic</span> <span class="ocrx_word" id="word_1_51" title="bbox 533 892 629 917; x_wconf 92">tastyzilla</span> <span class="ocrx_word" id="word_1_52" title="bbox 640 892 691 917; x_wconf 96">flaky</span> <span class="ocrx_word" id="word_1_53" title="bbox 700 892 754 911; x_wconf 96">fresh</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_27" title="bbox 978 893 1035 911">
	<p class="ocr_par" id="par_1_27" lang="eng" title="bbox 978 893 1035 911"><span class="ocr_line" id="line_1_27" title="bbox 978 893 1035 911; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_54" title="bbox 978 893 1035 911; x_wconf 96">19.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_28" title="bbox 1128 891 1199 914">
	<p class="ocr_par" id="par_1_28" lang="eng" title="bbox 1128 891 1199 914"><span class="ocr_line" id="line_1_28" title="bbox 1128 891 1199 914; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_55" title="bbox 1128 891 1199 914; x_wconf 95">$18.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_29" title="bbox 1308 891 1393 914">
	<p class="ocr_par" id="par_1_29" lang="eng" title="bbox 1308 891 1393 914"><span class="ocr_line" id="line_1_29" title="bbox 1308 891 1393 914; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_56" title="bbox 1308 891 1393 914; x_wconf 96">$342.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_30" title="bbox 81 985 210 1004">
	<p class="ocr_par" id="par_1_30" lang="eng" title="bbox 81 985 210 1004"><span class="ocr_line" id="line_1_30" title="bbox 81 985 210 1004; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_57" title="bbox 81 985 145 1004; x_wconf 93">Island</span> <span class="ocrx_word" id="word_1_58" title="bbox 155 986 210 1004; x_wconf 96">oasis</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_31" title="bbox 361 985 817 1010">
	<p class="ocr_par" id="par_1_31" lang="eng" title="bbox 361 985 817 1010"><span class="ocr_line" id="line_1_31" title="bbox 361 985 817 1010; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_59" title="bbox 361 985 478 1004; x_wconf 96">Treehouse</span> <span class="ocrx_word" id="word_1_60" title="bbox 489 985 522 1004; x_wconf 96">red</span> <span class="ocrx_word" id="word_1_61" title="bbox 534 991 601 1004; x_wconf 96">renew</span> <span class="ocrx_word" id="word_1_62" title="bbox 609 985 660 1004; x_wconf 96">food</span> <span class="ocrx_word" id="word_1_63" title="bbox 672 985 725 1004; x_wconf 96">brew</span> <span class="ocrx_word" id="word_1_64" title="bbox 735 985 817 1010; x_wconf 95">healthy</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_32" title="bbox 984 986 1028 1004">
	<p class="ocr_par" id="par_1_32" lang="eng" title="bbox 984 986 1028 1004"><span class="ocr_line" id="line_1_32" title="bbox 984 986 1028 1004; baseline 0 0; x_size 24.666666; x_descenders 6.1666665; x_ascenders 6.1666665"> <span class="ocrx_word" id="word_1_65" title="bbox 984 986 1028 1004; x_wconf 95">3.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_33" title="bbox 1128 984 1199 1007">
	<p class="ocr_par" id="par_1_33" lang="eng" title="bbox 1128 984 1199 1007"><span class="ocr_line" id="line_1_33" title="bbox 1128 984 1199 1007; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_66" title="bbox 1128 984 1199 1007; x_wconf 96">$39.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_34" title="bbox 1308 984 1393 1007">
	<p class="ocr_par" id="par_1_34" lang="eng" title="bbox 1308 984 1393 1007"><span class="ocr_line" id="line_1_34" title="bbox 1308 984 1393 1007; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_67" title="bbox 1308 984 1393 1007; x_wconf 96">$117.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_35" title="bbox 1040 1373 1138 1392">
	<p class="ocr_par" id="par_1_35" lang="eng" title="bbox 1040 1373 1138 1392"><span class="ocr_line" id="line_1_35" title="bbox 1040 1373 1138 1392; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_68" title="bbox 1040 1373 1138 1392; x_wconf 96">Subtotal:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_36" title="bbox 1255 1372 1348 1395">
	<p class="ocr_par" id="par_1_36" lang="eng" title="bbox 1255 1372 1348 1395"><span class="ocr_line" id="line_1_36" title="bbox 1255 1372 1348 1395; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_69" title="bbox 1255 1372 1265 1395; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_70" title="bbox 1275 1374 1348 1392; x_wconf 93">459.00</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_37" title="bbox 1040 1431 1173 1449">
	<p class="ocr_par" id="par_1_37" lang="eng" title="bbox 1040 1431 1173 1449"><span class="ocr_line" id="line_1_37" title="bbox 1040 1431 1173 1449; baseline 0 0; x_size 24.444445; x_descenders 6.1111112; x_ascenders 6.1111112"> <span class="ocrx_word" id="word_1_71" title="bbox 1040 1431 1077 1449; x_wconf 94">Tax</span> <span class="ocrx_word" id="word_1_72" title="bbox 1096 1431 1173 1449; x_wconf 94">10.00%</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_38" title="bbox 1255 1429 1334 1452">
	<p class="ocr_par" id="par_1_38" lang="eng" title="bbox 1255 1429 1334 1452"><span class="ocr_line" id="line_1_38" title="bbox 1255 1429 1334 1452; baseline 0 -3; x_size 24.833334; x_descenders 6.2083335; x_ascenders 6.2083335"> <span class="ocrx_word" id="word_1_73" title="bbox 1255 1429 1265 1452; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_74" title="bbox 1275 1431 1334 1449; x_wconf 95">45.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_39" title="bbox 1040 1486 1100 1505">
	<p class="ocr_par" id="par_1_39" lang="eng" title="bbox 1040 1486 1100 1505"><span class="ocr_line" id="line_1_39" title="bbox 1040 1486 1100 1505; baseline 0 0; x_size 24.296295; x_descenders 5.2962961; x_ascenders 6"> <span class="ocrx_word" id="word_1_75" title="bbox 1040 1486 1100 1505; x_wconf 96">Total:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_40" title="bbox 1255 1485 1348 1508">
	<p class="ocr_par" id="par_1_40" lang="eng" title="bbox 1255 1485 1348 1508"><span class="ocr_line" id="line_1_40" title="bbox 1255 1485 1348 1508; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_76" title="bbox 1255 1485 1265 1508; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_77" title="bbox 1277 1487 1348 1505; x_wconf 83">504.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_41" title="bbox 1039 1544 1200 1562">
	<p class="ocr_par" id="par_1_41" lang="eng" title="bbox 1039 1544 1200 1562"><span class="ocr_line" id="line_1_41" title="bbox 1039 1544 1200 1562; baseline 0 0; x_size 23.296295; x_descenders 5.2962961; x_ascenders 5"> <span class="ocrx_word" id="word_1_78" title="bbox 1039 1544 1138 1562; x_wconf 96">Amount</span> <span class="ocrx_word" id="word_1_79" title="bbox 1147 1544 1200 1562; x_wconf 96">Due:</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_42" title="bbox 1255 1542 1355 1565">
	<p class="ocr_par" id="par_1_42" lang="eng" title="bbox 1255 1542 1355 1565"><span class="ocr_line" id="line_1_42" title="bbox 1255 1542 1355 1565; baseline 0 -3; x_size 24.799999; x_descenders 6.1999998; x_ascenders 6.1999998"> <span class="ocrx_word" id="word_1_80" title="bbox 1255 1542 1266 1565; x_wconf 93">$</span> <span class="ocrx_word" id="word_1_81" title="bbox 1278 1544 1355 1562; x_wconf 96">504.90</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_43" title="bbox 1024 1524 1477 1581">
	<p class="ocr_par" id="par_1_43" lang="eng" title="bbox 1024 1524 1477 1581"><span class="ocr_line" id="line_1_43" title="bbox 1024 1524 1477 1581; textangle 90; x_size 605.33331; x_descenders 151.33333; x_ascenders 151.33333"> <span class="ocrx_word" id="word_1_82" title="bbox 1024 1524 1477 1581; x_wconf 90">|</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_44" title="bbox 116 1660 697 1685">
	<p class="ocr_par" id="par_1_44" lang="eng" title="bbox 116 1660 697 1685"><span class="ocr_line" id="line_1_44" title="bbox 116 1660 697 1685; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_83" title="bbox 116 1660 173 1679; x_wconf 96">Make</span> <span class="ocrx_word" id="word_1_84" title="bbox 182 1660 204 1679; x_wconf 96">all</span> <span class="ocrx_word" id="word_1_85" title="bbox 213 1660 285 1679; x_wconf 96">checks</span> <span class="ocrx_word" id="word_1_86" title="bbox 294 1660 378 1685; x_wconf 96">payable</span> <span class="ocrx_word" id="word_1_87" title="bbox 386 1662 408 1679; x_wconf 96">to</span> <span class="ocrx_word" id="word_1_88" title="bbox 424 1660 478 1679; x_wconf 96">Duke</span> <span class="ocrx_word" id="word_1_89" title="bbox 488 1660 553 1685; x_wconf 96">Realty</span> <span class="ocrx_word" id="word_1_90" title="bbox 561 1661 697 1685; x_wconf 96">Corporation.</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_45" title="bbox 116 1696 1289 1721">
	<p class="ocr_par" id="par_1_45" lang="eng" title="bbox 116 1696 1289 1721"><span class="ocr_line" id="line_1_45" title="bbox 116 1696 1289 1721; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_91" title="bbox 116 1696 128 1715; x_wconf 89">If</span> <span class="ocrx_word" id="word_1_92" title="bbox 135 1702 173 1721; x_wconf 96">you</span> <span class="ocrx_word" id="word_1_93" title="bbox 184 1696 233 1715; x_wconf 96">have</span> <span class="ocrx_word" id="word_1_94" title="bbox 242 1702 280 1721; x_wconf 96">any</span> <span class="ocrx_word" id="word_1_95" title="bbox 288 1697 393 1721; x_wconf 96">questions</span> <span class="ocrx_word" id="word_1_96" title="bbox 401 1697 521 1721; x_wconf 96">concerning</span> <span class="ocrx_word" id="word_1_97" title="bbox 531 1696 568 1715; x_wconf 95">this</span> <span class="ocrx_word" id="word_1_98" title="bbox 577 1697 653 1715; x_wconf 95">Invoice</span> <span class="ocrx_word" id="word_1_99" title="bbox 663 1696 730 1721; x_wconf 96">please</span> <span class="ocrx_word" id="word_1_100" title="bbox 739 1698 819 1715; x_wconf 96">contact</span> <span class="ocrx_word" id="word_1_101" title="bbox 835 1697 876 1715; x_wconf 96">Lina</span> <span class="ocrx_word" id="word_1_102" title="bbox 887 1696 989 1721; x_wconf 96">Upchurch</span> <span class="ocrx_word" id="word_1_103" title="bbox 999 1702 1025 1715; x_wconf 94">on</span> <span class="ocrx_word" id="word_1_104" title="bbox 1044 1697 1083 1715; x_wconf 94">+52</span> <span class="ocrx_word" id="word_1_105" title="bbox 1093 1697 1145 1719; x_wconf 96">(915)</span> <span class="ocrx_word" id="word_1_106" title="bbox 1155 1697 1256 1715; x_wconf 96">649-1513</span> <span class="ocrx_word" id="word_1_107" title="bbox 1266 1702 1289 1715; x_wconf 96">or</span> </span></p>
	 </div>
	 <div class="ocr_carea" id="block_1_46" title="bbox 116 1732 375 1757">
	<p class="ocr_par" id="par_1_46" lang="eng" title="bbox 116 1732 375 1757"><span class="ocr_line" id="line_1_46" title="bbox 116 1732 375 1757; baseline 0 -6; x_size 25; x_descenders 6; x_ascenders 6"> <span class="ocrx_word" id="word_1_108" title="bbox 116 1732 375 1757; x_wconf 91">lupchurchgf@<a class="external-link" href="http://lycos.com" rel="nofollow">lycos.com</a></span> </span></p>
	 </div>
	</div>
 </body>
</html>

A list of entities is the result of the model execution. An entity consists of a label name, count index, label content, and OCR words that match the entity region.

{
	"entities": [
		{
			"name": "DebitNoteId",
			"words": [
				{
					"bbox": [
						0.8227450980392157,
						0.1696969696969697,
						0.9168627450980392,
						0.18262626262626264
					],
					"id": "page0_area2_paragraph2_line3_word15",
					"page": 0,
					"content": "DNT6268231"
				}
			],
			"index": 0,
			"score": 1,
			"content": "DNT6268231"
		},
		{
			"name": "DebitNoteDate",
			"words": [
				{
					"bbox": [
						0.8227450980392157,
						0.19818181818181818,
						0.92,
						0.21151515151515152
					],
					"id": "page0_area2_paragraph2_line4_word23",
					"page": 0,
					"content": "2018-06-30"
				}
			],
			"index": 0,
			"score": 0.97,
			"content": "2018-06-30"
		}
	]
}

HTML source

The input document are HTML (TXT files are converted into HTML)

{
  "html_src": "https://dev.rpaplatform.org/api/v1/s3/proxy/data/document_set/cb8b87a4-d08a-4307-913c-0243ec6f684d/1eda92d9-5027-4526-ae40-bcf985c4c4f7_minified.html"
}

Tagged HTML is the result of model execution. Tagged HTML consist of rpa-selection tags with information about label and order (multiple case)

<html>
	<head>
	</head>
	<body>
		<div class="grid-body">
			<div class="invoice-title">
				<div class="row">
					<div class="col-xs-12"></div>
				</div><br />
				<div class="row">
					<div class="col-xs-12">
						<h2>invoice<br /><span class="small">order #<rpa-selection data-content="1097"
									data-order="0" data-type="Order Number">1097</rpa-selection></span></h2>
					</div>
				</div>
			</div>
			<hr />
			<div class="row">
				<div class="col-xs-6">
					<address><strong>Billed To:</strong><br />
						<rpa-selection data-content="Costco Wholesale" data-order="0" data-type="Name">
							Costco Wholesale</rpa-selection><br />
						<rpa-selection data-content="999 Lake Drive" data-order="0" data-type="Addresses">
							999 Lake Drive</rpa-selection><br />
						<rpa-selection data-content="Issaquah, WA 98027" data-order="0"
							data-type="Addresses">Issaquah, WA 98027</rpa-selection><br /><abbr
							title="Phone">P:</abbr>
						<rpa-selection data-content="(222) 417-0141" data-order="0"
							data-type="Phone Numbers">(222) 417-0141</rpa-selection>
					</address>
				</div>
			</div>
		</div>
	</body>
</html>

Spacy IE Models

Platform uses spacy NLP inside for data processing for the following models:

  • ml_ie_spacy2_model
  • ml_ie_spacy3_model
  • ml_iehtml_spacy2_model
  • ml_iehtml_spacy3_model

Information Extraction as a Pipeline

Information Extraction process is implemented in EasyRPA as a pipeline. There is more to this pipeline than ML models: platform also includes several options for extending ML with rules and dictionaries.  

Taking a closer look at both processes, let's investigate what stages are part of each: model training and execution.  

Model Training Process

Model training

This step of EasyRPA involves training the ML model using the provided training set. The system automatically shuffles the provided set, runs training for a specified number of iterations, and selects the best model.

Process developer can specify a model type, number of training iterations, etc. using a configuration JSON file.

Package creation

The trained model comes packaged with configuration files and uploaded to the Nexus repo. 

Information Extraction Process

Model execution

The model is run once for each document.

Model Training Configuration File

To train a Spacy Information Extraction models you need to provide a JSON that defines configuration parameters for the training process.

Let's take a closer look at these configuration settings.

  • ocr_fixes(list of objects)(optional) -  values that should be replaced with other values are defined here. In the example below value "G4LD" will be replaced with the value "64LD". 
  • trainer_name(string)(required) - a python artifact that produces model packages for processing with a specific model type. There are two modules in it: a module for training on tagged data and generating a trained model package, and a module for downloading the trained model from the Nexus or from the cache and running it on the input data. Please, refer to Out of the box IE models and Out of the box IEHTML models for more details.
  • trainer_version(string)(required) - a trainer version. Please, refer to Out of the box IE models and Out of the box IEHTML models for more details.
  • trainer_description(string)(required) - a trainer description.
  • lang(string)(optional) - the language of input data. The default value is 'en'.
  • iterations(number)(optional) - number of iterations of model training on a given training set. The default value is '30'.
  • concat_single_entities(boolean)(optional) - The default value is true.
  • post_processing_rules(list of objects)(optional) - after NER extraction model uses EntityMatcher with rules defined in post_processing_rules.json. Configuration JSON should contain a list of label names with regular expressions for searching for entities.
  • base_model_patterns(list of objects)(optional) - used to configure EntityRuler for labeling datum elements. It runs before fetching data and provides model with additional information on the document structure increasing accuracy of data extraction.
  • labels(list of objects)(optional) - labels are added to the NER pipe at the training stage. In case of empty configuration all labels found in the training dataset will be automatically added to the model, and the output dimension will be inferred automatically (expensive operation). The multiplicity flag affects how the entity index is calculated at processing stage. Index of labels with multiplicity equals True increments through the whole document while for labels with False multiplicity index is always zero. 
{
	"ocr_fixes": {
		"G4LD": "64LD"
	},
	"model": {
		"trainer_name": "easyrpaml_ie_spacy3_model",
		"trainer_version": "2.4.1",
		"trainer_description": "Information Extraction",
		"train_config": {
			"lang": "en",
			"iterations": 5
		},
		"process_config": {
			"concat_single_entities": true
		}
	},
	"post_processing_rules": [
		{
			"label": "kwDebitNoteID",
			"regex": [
				"Debit",
				"^Note.*$",
				"#",
				":"
			]
		}
	],
	"base_model_patterns": [
		{
			"label": "kwDebitNoteID",
			"id": "kwDebitNoteID",
			"pattern": [
				{
					"TEXT": "Debit"
				},
				{
					"TEXT": {
						"REGEX": "^Note.*$"
					}
				},
				{
					"TEXT": "#"
				},
				{
					"TEXT": ":"
				}
			]
		},
		{
			"label": "kwDebitNoteID",
			"id": "kwDebitNoteID",
			"pattern": [
				{
					"TEXT": "Debit"
				},
				{
					"TEXT": {
						"REGEX": "^Note.*$"
					}
				},
				{
					"TEXT": ":"
				}
			]
		},
		{
			"label": "kwDebitNoteDate",
			"id": "kwDebitNoteDate",
			"pattern": [
				{
					"TEXT": "Debit"
				},
				{
					"TEXT": "Note"
				},
				{
					"TEXT": "Date"
				},
				{
					"TEXT": {
						"REGEX": "^[:|;]$"
					}
				}
			]
		}
	],
	"labels": {
		"DebitNoteDate": false,
		"DebitNoteId": false,
		"VendorSupportProgram": false,
		"Percent": false,
		"TotalAmount": false,
		"SKU": true,
		"SKUDescription": true,
		"Units": true,
		"Quota": true,
		"kwDebitNoteID": false,
		"kwDebitNoteDate": false,
		"BillingFrequency": false
	}
}

ml_ie_openai_model

The model uses OpenAI API to call LLM models for request processing.

To work with the ml_ie_openai_model you should specify the OPENAI_API_KEY during installation or update it in the .env file of your CS installation

You can also change OpenAI OPENAI_BASE_URL to switch on another LLM provider. To do this you need to define the environment variable for the ml container on CS installation machine, for example:

	 . . . . . .
	 ml:
	image: "ci.rpaplatform.org:8080/rpaplatform/easy-rpa-ml:3.2.0"
	restart: "always"
	environment:
	. . . . . .
	- "OPENAI_API_KEY=${OPENAI_API_KEY}"
	- "OPENAI_BASE_URL=https://api.deepseek.com"
	 	. . . . . . 

This model minifies (depending of hocr2text rendering selected in model configuration) hOCR html, then send to OpenAI request like this:

You are a good expert of extracting data from invoice documents. You receive HTML document as the result of OCR processing of scanned invoice, and the list of fields you should extract.
As an output you have to provide csv file with two columns: field tag and list of HTML tags "id" property. Pay attention that one extracted field may have several tags.
For table items provide a separate line for each row.
For example:
###BEGIN OF EXAMPLE
User ask you to extract:
```
Find all accounts in the balance sheet and for each item found extract:
	- company name with tag COMPANY
	- account with tag ACCOUNT
	- balance with tag BALANCE
Do not tag table headers.
```
Your input HTML is:
```html
<html>
	<body>
		<p>
			<div><span id="word_0_1">Remittance</span><span id="word_0_2">Advice</span></div>
			<div><span id="word_0_3">Company:</span> <span id="word_0_4">IBA</span><span id="word_0_5">Group</span></div>
			<div><span id="word_0_6">Income</span><span id="word_0_7">Fund</span></div>
		</p>
		<p>
			<div><span id="word_1_1">ACCOUNTS</span><span id="word_1_2">BALANCE</span></div>
			<div><span id="word_1_3">12341234</span><span id="word_1_4">$5000</span></div>
			<div><span id="word_1_5">22354123</span><span id="word_1_6">$1000</span></div>
		</p>
</body>
</html>
```
Your answer should be:
```
"field_name","tag_id"
"COMPANY","word_0_4,word_0_5"
"ACCOUNT","word_1_3"
"BALANCE","word_1_4"
"ACCOUNT","word_1_5"
"BALANCE","word_1_6"
```
###END OF EXAMPLE

Now your task is the following:
```
Find all items in the invoice and for each item found extract:
	- item name with tag PRODUCT
	- description with tag DESCRIPTION
	- unit price with tag PRICE.
	- quantity with tag QUANTITY
Do not tag table headers. Combine multiple lines of description tag into one tag if possible.
Also extract invoice information:
	- Company name of the client with tag CLIENT
	- Client address with tag ADDRESS
	- Invoice number with tag INVOICENUMBER
	- Date of issue with tag ISSUED
	- Due Date with tag DUE_DATE
	- Total amount, TOTAL
```
Your input HTML is:
```html
<html><body><p><div><span id="word_0_1">INVOICE</span><span id="word_0_2">a</span></div><div><span id="word_0_3">DATE</span><span id="word_0_13">08</span><span id="word_0_14">Mar,</span><span id="word_0_15">2020</span><span id="word_0_4">INVOICE</span><span id="word_0_5">NO</span><span id="word_0_16">4453074013</span><span id="word_0_6">Park</span><span id="word_0_7">City</span><span id="word_0_8">Group</span><span id="word_0_9">DC</span><span id="word_0_10">087</span><span id="word_0_11">Jackson</span><span id="word_0_12">Drive</span><span id="word_0_17">Washington,</span><span id="word_0_18">86-723</span><span id="word_0_19">+86</span><span id="word_0_20">(824)</span><span id="word_0_21">519-7851</span><span id="word_0_22">citizens@corp.com</span></div><div><span id="word_0_23">INVOICE</span><span id="word_0_24">TO</span></div><div><span id="word_0_25">Truett-Hurst,</span><span id="word_0_26">Inc.</span><span id="word_0_27">869</span><span id="word_0_28">Summerview</span><span id="word_0_29">Center</span><span id="word_0_30">Balchik,</span><span id="word_0_31">62021</span><span id="word_0_32">+92</span><span id="word_0_33">(538)</span><span id="word_0_34">622-2228</span><span id="word_0_35">gspeddin12@eepurl.com</span></div><div><span id="word_0_36">SALESPERSON</span><span id="word_0_37">JOB</span><span id="word_0_38">PAYMENT</span><span id="word_0_39">TERMS</span><span id="word_0_40">DUE</span><span id="word_0_41">DATE</span></div><div><span id="word_0_42">Due</span><span id="word_0_43">on</span><span id="word_0_44">Receipt</span><span id="word_0_45">08</span><span id="word_0_46">May,</span><span id="word_0_47">2020</span></div><div><span id="word_0_48">QUANTITY</span><span id="word_0_49">DESCRIPTION</span><span id="word_0_50">UNIT</span><span id="word_0_51">PRICE</span><span id="word_0_52">LINE</span><span id="word_0_53">TOTAL</span></div><div><span id="word_0_54">19.00</span><span id="word_0_55">Initation</span><span id="word_0_56">crab</span><span id="word_0_57">meat</span><span id="word_0_60">Mountain</span><span id="word_0_61">food</span><span id="word_0_62">magic</span><span id="word_0_63">healthy</span><span id="word_0_64">yummy</span><span id="word_0_65">food</span><span id="word_0_58">$150.00</span><span id="word_0_59">$2850.00</span></div><div><span id="word_0_66">11.00</span><span id="word_0_67">Tomato</span><span id="word_0_68">Devine</span><span id="word_0_69">healthy</span><span id="word_0_70">desire</span><span id="word_0_71">organic</span><span id="word_0_72">crimson</span><span id="word_0_73">fresh</span><span id="word_0_74">$192.00</span><span id="word_0_75">$2112.00</span></div><div><span id="word_0_76">Subtotal</span><span id="word_0_79">Discount</span><span id="word_0_80">15.00%</span><span id="word_0_85">Sales</span><span id="word_0_86">Tax</span><span id="word_0_87">20.00%</span><span id="word_0_77">$</span><span id="word_0_78">4962.00</span><span id="word_0_81">$</span><span id="word_0_82">893.16</span><span id="word_0_83">$</span><span id="word_0_84">992.40</span></div><div><span id="word_0_88">Total</span><span id="word_0_89">$5061.24</span></div><div><span id="word_0_90">TRANSFER</span><span id="word_0_91">DETAILS</span></div><div><span id="word_0_92">Bank</span><span id="word_0_93">Transfer</span><span id="word_0_94">BANK</span><span id="word_0_100">Income</span><span id="word_0_101">II</span><span id="word_0_96">Convertible</span><span id="word_0_97">&</span><span id="word_0_103">Number</span><span id="word_0_98">Routing</span><span id="word_0_99">Number</span><span id="word_0_104">8284352178</span></div></p></body></html>
```

The OpenAI request is customizable, how to do this we explains below.

Model Training

Training proces creates a new model with default promtps configuration. You can use any document set with existing training data to train a model. The trainer do not use the training data, the only training configuration will be used. Here is sample model training configuration:

{
	"debug": false,
	"messages": [
		{
			"role": "system",
			"content": "{systemRolePrompt}"
		},
		{
			"role": "user",
			"content": "{userRolePrompt}"
		}
	],
	"systemRolePrompt": "You are a good expert of extracting data from invoice documents. You receive HTML document as the result of OCR processing of scanned invoice, and the list of fields you should extract.
As an output you have to provide csv file with two columns: field tag and list of HTML tags "id" property. Pay attention that one extracted field may have several tags.
For table items provide a separate line for each row.
For example:
###BEGIN OF EXAMPLE
User ask you to extract:
```
Find all accounts in the balance sheet and for each item found extract:
- company name with tag COMPANY
- account with tag ACCOUNT
- balance with tag BALANCE
Do not tag table headers.
```
Your input HTML is:
```html
<html>
	<body>
		<p>
			<div><span id="word_0_1">Remittance</span><span id="word_0_2">Advice</span></div>
			<div><span id="word_0_3">Company:</span> <span id="word_0_4">IBA</span><span id="word_0_5">Group</span></div>
			<div><span id="word_0_6">Income</span><span id="word_0_7">Fund</span></div>
		</p>
		<p>
			<div><span id="word_1_1">ACCOUNTS</span><span id="word_1_2">BALANCE</span></div>
			<div><span id="word_1_3">12341234</span><span id="word_1_4">$5000</span></div>
			<div><span id="word_1_5">22354123</span><span id="word_1_6">$1000</span></div>
		</p>
</body>
</html>
```
Your answer should be:
```
"field_name","tag_id"
"COMPANY","word_0_4,word_0_5"
"ACCOUNT","word_1_3"
"BALANCE","word_1_4"
"ACCOUNT","word_1_5"
"BALANCE","word_1_6"
```
###END OF EXAMPLE",
	"userRolePrompt": "Now your task is the following:
```
Find all items in the invoice and for each item found extract:
- item name with tag PRODUCT
- description with tag DESCRIPTION
- unit price with tag PRICE.
- quantity with tag QUANTITY
Do not tag table headers. Combine multiple lines of description tag into one tag if possible.
Also extract invoice information:
- Company name of the client with tag CLIENT
- Client address with tag ADDRESS
- Invoice number with tag INVOICENUMBER
- Date of issue with tag ISSUED
- Due Date with tag DUE_DATE
- Total amount, TOTAL
```
Your input HTML is:
```html
{html}
```",
	"html": "",
	"open_ai_model": "gpt-4o",
	"hocr2html": {
		"type": "table",
		"bbox_to_cell_tolerance_x": 10,
		"bbox_to_cell_tolerance_y": 10,
		"cell_to_row_tolerance": 20,
		"row_to_table_tolerance": 10
	},
	"tag_to_entity": {
		"PRODUCT": "Product Name",
		"DESCRIPTION": "Product Description",
		"QUANTITY": "Quantity",
		"PRICE": "Price",
		"CLIENT": "Company Name",
		"ADDRESS": "Street Address",
		"INVOICENUMBER": "Invoice Number",
		"ISSUED": "Invoice Date",
		"DUE_DATE": "Due Date",
		"TOTAL": "Total Amount"
	}
}

where:

  • prompts_config - the default prompts configuration saved into trained model
  • messages - a prompt messages structure to use during sending to OpenAI API
  • html - the document simplified html that model creates and injected into prompt context
  • open_ai_model - an OpenAI model to use, default is gpt-4o
  • tag_to_entity - an entity to response tag mapping to map OpenAI tagged document into documents entities
  • debug - boolean switches debug messages on
  • hocr2html - HOCR to html rendering configuration

Prompts configuration

The prompts_config is a map of parameter the model use to create a OpenAI request. Model get it from:

  • configuration parameter of the MlTask call
  • model default configuration

The MlTask configuration parameter overrides the existing model default configuration, i.e. you can add only a changes iteration into MlTask and keep the existing from default.

Here is a platform task code that prepare Ml call:

MlTaskData mlTaskData = new MlTaskData(modelName, modelVersion);
mlTaskData.getConfiguration().putAll(documentContext.getMlConfiguration());
. . . . .
default Map getMlConfiguration() {
	 return (Map) getSettings().getOrDefault("mlConfiguration", new HashMap());
}

So to pass the promts into the model, you need to specify mlConfiguration map in document set settings of document processor configuraion:

Or in the configuration parameter of AP that uses datastore document context:

The messages parameter defines a promt structure for OpenAI request. Here is models python code that call OpenAI:

client = OpenAI()

prompt_completion = client.chat.completions.create(
	model=openai_model,
	messages=messages,
	temperature=0,
)
openai_response = str(prompt_completion.choices[0].message.content)

The default messages structure is required, here is default structure:

{
	"messages": [{
			"role": "system",
			"content": "{systemRolePrompt}"
		}, {
			"role": "user",
			"content": "{userRolePrompt}"
		}
	],
}

It sends request with system ( {systemRolePrompt} ) and user {userRolePrompt} ) roles. The {systemRolePrompt} and {userRolePrompt} are refers to keys from the promts configuration.

Only one level key references are allowed in the promts configuration.

The html key is injected by the model and contains minified document.

You can completlly change the default messages structure, or redefine the systemRolePrompt and userRolePrompt.

The userRolePrompt always need to be changed according to your document set and fields you need to extract. It contains field description to extract for OpenAI.

The Intelligent Document Processing (IDP) contains document set IDP_SAMPLE_INVOICE_OPENAI that configured to work with the ml_ie_openai_model

HOCR to HTML rendering configuration

The hocr2text parameters specify a simplified text rendering algotitm? that is defined by hocr2text.type key. There are the following rendering exist:

  • default - put word in a order htat is exist in HOCR
  • table - put words according to recognized table layout
  • table-rows - using table render to obtain table layout and put words according to rows flow, without cells separation

Default HOCR to HTML rendering (default)

The default rendering uses the HOCR tags normal ordering to provide output with the following rules:

  • <div class="ocr_page"> → <p>
  • <span class="ocr_line"> → <div>
  • <span class="ocrx_word"> → <span id="word_[Page  index]_[Word index on page]">[Word]</span>

Here is a typical rendered html:

<html>
	<body>
		<p>
			<div><span id="word_0_1">Remittance</span><span id="word_0_2">Advice</span></div>
			<div><span id="word_0_3">Company:</span> <span id="word_0_4">IBA</span><span id="word_0_5">Group</span></div>
			<div><span id="word_0_6">Income</span><span id="word_0_7">Fund</span></div>
		</p>
		<p>
			<div><span id="word_1_1">ACCOUNTS</span><span id="word_1_2">BALANCE</span></div>
			<div><span id="word_1_3">12341234</span><span id="word_1_4">$5000</span></div>
			<div><span id="word_1_5">22354123</span><span id="word_1_6">$1000</span></div>
		</p>
	</body>
</html>

Table HOCR to HTML rendering (table)

This renderer groups HOCR bboxes into cells,rows and tables like on the following pictures:

The renderer uses the following settings:

"hocr2text": {
	"type": "table",
	"bbox_to_cell_tolerance_x": 10,
	"bbox_to_cell_tolerance_y": 10,
	"cell_to_row_tolerance": 20,
	"row_to_table_tolerance": 10
	},

Where:

  • bbox_to_cell_tolerance_x - a max width in pixels between 2 bboxes that are belongs to a same table cell
  • bbox_to_cell_tolerance_y - a max height in pixels between 2 bboxes that are belongs to a same table cell
  • cell_to_row_tolerance - a max height in pixels between 2 cells that are belongs to a same row
  • row_to_table_tolerance- a max height in pixels between 2 rows that are belongs to a same table

The renderer do the following:

  • tries to combine bboxes into cells using bbox_to_cell_tolerance_x and bbox_to_cell_tolerance_y
  • then combines cells into rows using cell_to_row_tolerance
  • then combines rows into tables using row_to_table_tolerance
  • renders words according to cell order

The debug=true upload to storage a debug JPG with table layout:

It saves page elements:

  • <div class="ocr_page"> → <p>

Here is a typical rendered html:

<html>
	<body>
		<p>
			<table>
				<tr>
					<td>
						<span id="word_0_3">DATE</span>
					</td>
					<td>
						<span id="word_0_13">08</span>
						<span id="word_0_14">Mar,</span>
						<span id="word_0_15">2020</span>
					</td>
					<td>
						<span id="word_0_4">INVOICE</span>
						<span id="word_0_5">NO</span>
					</td>
					<td>
						<span id="word_0_16">4453074013</span>
					</td>
					<td>
						<span id="word_0_6">Park</span>
						<span id="word_0_7">City</span>
						<span id="word_0_8">Group</span>
						<span id="word_0_9">DC</span>
						<span id="word_0_10">087</span>
						<span id="word_0_11">Jackson</span>
						<span id="word_0_12">Drive</span>
						<span id="word_0_17">Washington,</span>
						<span id="word_0_18">86-723</span>
						<span id="word_0_19">+86</span>
						<span id="word_0_20">(824)</span>
						<span id="word_0_21">519-7851</span>
						<span id="word_0_22">citizens@corp.com</span>
					</td>
				</tr>
			</table>
		</p>
	</body>
</html>

Table-Rows HOCR to HTML rendering (table-rows)

This renderer uses the same table page grouping mechanizm as table redering, but instead of puting <table> into result html, fill out only rows without cell groupping:

  • <div class="ocr_page"> → <p>
  • row → <div>
  • <span class="ocrx_word"> → <span id="word_[Page  index]_[Word index on page]">[Word]</span>

Here is a typical rendered html:

<html>
	<body>
		<p>
			<div>
				<span id="word_0_1">INVOICE</span>
			</div>
			<div>
				<span id="word_0_48">QUANTITY</span>
				<span id="word_0_49">DESCRIPTION</span>
				<span id="word_0_50">UNIT</span>
				<span id="word_0_51">PRICE</span>
				<span id="word_0_52">LINE</span>
				<span id="word_0_53">TOTAL</span>
			</div>
			<div>
				<span id="word_0_54">19.00</span>
				<span id="word_0_55">Initation</span>
				<span id="word_0_56">crab</span>
				<span id="word_0_57">meat</span>
				<span id="word_0_60">Mountain</span>
				<span id="word_0_61">food</span>
				<span id="word_0_62">magic</span>
				<span id="word_0_63">healthy</span>
				<span id="word_0_64">yummy</span>
				<span id="word_0_65">food</span>
				<span id="word_0_58">$150.00</span>
				<span id="word_0_59">$2850.00</span>
			</div>
		</p>
	</body>
</html>