Information extraction(IE) is the automated retrieval of specificinformationrelated to a selected topic from input data.Information extractiontools make it possible to pullinformationfrom text documents, databases, websites or multiple sources.
EasyRPA provides infrastructure to create and run machine learning models that extract information from PDF, images, TXT and HTML documents.
Curentlly platform has the following IE set of models:
hOCR source base
HTML source base
hOCR source
The input document are PDF and images that are converted into hOCR using platforms OCR.
A list of entities is the result of the model execution. An entity consists of a label name, count index, label content, and OCR words that match the entity region.
Platform uses spacy NLP inside for data processing for the following models:
ml_ie_spacy2_model
ml_ie_spacy3_model
ml_iehtml_spacy2_model
ml_iehtml_spacy3_model
Information Extraction as a Pipeline
Information Extraction process is implemented in EasyRPA as a pipeline. There is more to this pipeline than ML models: platform also includes several options for extending ML with rules and dictionaries.
Taking a closer look at both processes, let's investigate what stages are part of each: model training and execution.
Model Training Process
Model training
This step of EasyRPA involves training the ML model using the provided training set. The system automatically shuffles the provided set, runs training for a specified number of iterations, and selects the best model.
Process developer can specify a model type, number of training iterations, etc. using a configuration JSON file.
Package creation
The trained model comes packaged with configuration files and uploaded to the Nexus repo.
Information Extraction Process
Model execution
The model is run once for each document.
Model Training Configuration File
To train a Spacy Information Extraction models you need to provide a JSON that defines configuration parameters for the training process.
Let's take a closer look at these configuration settings.
ocr_fixes(list of objects)(optional) - values that should be replaced with other values are defined here. In the example below value "G4LD" will be replaced with the value "64LD".
trainer_name(string)(required) - a python artifact that produces model packages for processing with a specific model type. There are two modules in it: a module for training on tagged data and generating a trained model package, and a module for downloading the trained model from the Nexus or from the cache and running it on the input data. Please, refer to Out of the box IE models and Out of the box IEHTML modelsfor more details.
trainer_description(string)(required) - a trainer description.
lang(string)(optional) - the language of input data. The default value is 'en'.
iterations(number)(optional) - number of iterations of model training on a given training set. The default value is '30'.
concat_single_entities(boolean)(optional) - The default value is true.
post_processing_rules(list of objects)(optional) - after NER extraction model uses EntityMatcher with rules defined in post_processing_rules.json. Configuration JSON should contain a list of label names with regular expressions for searching for entities.
base_model_patterns(list of objects)(optional) - used to configure EntityRuler for labeling datum elements. It runs before fetching data and provides model with additional information on the document structure increasing accuracy of data extraction.
labels(list of objects)(optional) - labels are added to the NER pipe at the training stage. In case of empty configuration all labels found in the training dataset will be automatically added to the model, and the output dimension will be inferred automatically (expensive operation). The multiplicity flag affects how the entity index is calculated at processing stage. Index of labels with multiplicity equals True increments through the whole document while for labels with False multiplicity index is always zero.
The model uses OpenAI API to call LLM models for request processing.
To work with the ml_ie_openai_model you should specify the OPENAI_API_KEY during installation or update it in the .env file of your CS installation
You can also change OpenAI OPENAI_BASE_URL to switch on another LLM provider. To do this you need to define the environment variable for the ml container on CS installation machine, for example:
This model minifies (depending of hocr2text rendering selected in model configuration) hOCR html, then send to OpenAI request like this:
You are a good expert of extracting data from invoice documents. You receive HTML document as the result of OCR processing of scanned invoice, and the list of fields you should extract.
As an output you have to provide csv file with two columns: field tag and list of HTML tags "id" property. Pay attention that one extracted field may have several tags.
For table items provide a separate line for each row.
For example:
###BEGIN OF EXAMPLE
User ask you to extract:
```
Find all accounts in the balance sheet and for each item found extract:
- company name with tag COMPANY
- account with tag ACCOUNT
- balance with tag BALANCE
Do not tag table headers.
```
Your input HTML is:
```html
<html>
<body>
<p>
<div><span id="word_0_1">Remittance</span><span id="word_0_2">Advice</span></div>
<div><span id="word_0_3">Company:</span> <span id="word_0_4">IBA</span><span id="word_0_5">Group</span></div>
<div><span id="word_0_6">Income</span><span id="word_0_7">Fund</span></div>
</p>
<p>
<div><span id="word_1_1">ACCOUNTS</span><span id="word_1_2">BALANCE</span></div>
<div><span id="word_1_3">12341234</span><span id="word_1_4">$5000</span></div>
<div><span id="word_1_5">22354123</span><span id="word_1_6">$1000</span></div>
</p>
</body>
</html>
```
Your answer should be:
```
"field_name","tag_id"
"COMPANY","word_0_4,word_0_5"
"ACCOUNT","word_1_3"
"BALANCE","word_1_4"
"ACCOUNT","word_1_5"
"BALANCE","word_1_6"
```
###END OF EXAMPLE
Now your task is the following:
```
Find all items in the invoice and for each item found extract:
- item name with tag PRODUCT
- description with tag DESCRIPTION
- unit price with tag PRICE.
- quantity with tag QUANTITY
Do not tag table headers. Combine multiple lines of description tag into one tag if possible.
Also extract invoice information:
- Company name of the client with tag CLIENT
- Client address with tag ADDRESS
- Invoice number with tag INVOICENUMBER
- Date of issue with tag ISSUED
- Due Date with tag DUE_DATE
- Total amount, TOTAL
```
Your input HTML is:
```html
<html><body><p><div><span id="word_0_1">INVOICE</span><span id="word_0_2">a</span></div><div><span id="word_0_3">DATE</span><span id="word_0_13">08</span><span id="word_0_14">Mar,</span><span id="word_0_15">2020</span><span id="word_0_4">INVOICE</span><span id="word_0_5">NO</span><span id="word_0_16">4453074013</span><span id="word_0_6">Park</span><span id="word_0_7">City</span><span id="word_0_8">Group</span><span id="word_0_9">DC</span><span id="word_0_10">087</span><span id="word_0_11">Jackson</span><span id="word_0_12">Drive</span><span id="word_0_17">Washington,</span><span id="word_0_18">86-723</span><span id="word_0_19">+86</span><span id="word_0_20">(824)</span><span id="word_0_21">519-7851</span><span id="word_0_22">citizens@corp.com</span></div><div><span id="word_0_23">INVOICE</span><span id="word_0_24">TO</span></div><div><span id="word_0_25">Truett-Hurst,</span><span id="word_0_26">Inc.</span><span id="word_0_27">869</span><span id="word_0_28">Summerview</span><span id="word_0_29">Center</span><span id="word_0_30">Balchik,</span><span id="word_0_31">62021</span><span id="word_0_32">+92</span><span id="word_0_33">(538)</span><span id="word_0_34">622-2228</span><span id="word_0_35">gspeddin12@eepurl.com</span></div><div><span id="word_0_36">SALESPERSON</span><span id="word_0_37">JOB</span><span id="word_0_38">PAYMENT</span><span id="word_0_39">TERMS</span><span id="word_0_40">DUE</span><span id="word_0_41">DATE</span></div><div><span id="word_0_42">Due</span><span id="word_0_43">on</span><span id="word_0_44">Receipt</span><span id="word_0_45">08</span><span id="word_0_46">May,</span><span id="word_0_47">2020</span></div><div><span id="word_0_48">QUANTITY</span><span id="word_0_49">DESCRIPTION</span><span id="word_0_50">UNIT</span><span id="word_0_51">PRICE</span><span id="word_0_52">LINE</span><span id="word_0_53">TOTAL</span></div><div><span id="word_0_54">19.00</span><span id="word_0_55">Initation</span><span id="word_0_56">crab</span><span id="word_0_57">meat</span><span id="word_0_60">Mountain</span><span id="word_0_61">food</span><span id="word_0_62">magic</span><span id="word_0_63">healthy</span><span id="word_0_64">yummy</span><span id="word_0_65">food</span><span id="word_0_58">$150.00</span><span id="word_0_59">$2850.00</span></div><div><span id="word_0_66">11.00</span><span id="word_0_67">Tomato</span><span id="word_0_68">Devine</span><span id="word_0_69">healthy</span><span id="word_0_70">desire</span><span id="word_0_71">organic</span><span id="word_0_72">crimson</span><span id="word_0_73">fresh</span><span id="word_0_74">$192.00</span><span id="word_0_75">$2112.00</span></div><div><span id="word_0_76">Subtotal</span><span id="word_0_79">Discount</span><span id="word_0_80">15.00%</span><span id="word_0_85">Sales</span><span id="word_0_86">Tax</span><span id="word_0_87">20.00%</span><span id="word_0_77">$</span><span id="word_0_78">4962.00</span><span id="word_0_81">$</span><span id="word_0_82">893.16</span><span id="word_0_83">$</span><span id="word_0_84">992.40</span></div><div><span id="word_0_88">Total</span><span id="word_0_89">$5061.24</span></div><div><span id="word_0_90">TRANSFER</span><span id="word_0_91">DETAILS</span></div><div><span id="word_0_92">Bank</span><span id="word_0_93">Transfer</span><span id="word_0_94">BANK</span><span id="word_0_100">Income</span><span id="word_0_101">II</span><span id="word_0_96">Convertible</span><span id="word_0_97">&</span><span id="word_0_103">Number</span><span id="word_0_98">Routing</span><span id="word_0_99">Number</span><span id="word_0_104">8284352178</span></div></p></body></html>
```
The OpenAI request is customizable, how to do this we explains below.
Model Training
Training proces creates a new model with default promtps configuration. You can use any document set with existing training data to train a model. The trainer do not use the training data, the only training configuration will be used. Here is sample model training configuration:
{
"debug": false,
"messages": [
{
"role": "system",
"content": "{systemRolePrompt}"
},
{
"role": "user",
"content": "{userRolePrompt}"
}
],
"systemRolePrompt": "You are a good expert of extracting data from invoice documents. You receive HTML document as the result of OCR processing of scanned invoice, and the list of fields you should extract.
As an output you have to provide csv file with two columns: field tag and list of HTML tags "id" property. Pay attention that one extracted field may have several tags.
For table items provide a separate line for each row.
For example:
###BEGIN OF EXAMPLE
User ask you to extract:
```
Find all accounts in the balance sheet and for each item found extract:
- company name with tag COMPANY
- account with tag ACCOUNT
- balance with tag BALANCE
Do not tag table headers.
```
Your input HTML is:
```html
<html>
<body>
<p>
<div><span id="word_0_1">Remittance</span><span id="word_0_2">Advice</span></div>
<div><span id="word_0_3">Company:</span> <span id="word_0_4">IBA</span><span id="word_0_5">Group</span></div>
<div><span id="word_0_6">Income</span><span id="word_0_7">Fund</span></div>
</p>
<p>
<div><span id="word_1_1">ACCOUNTS</span><span id="word_1_2">BALANCE</span></div>
<div><span id="word_1_3">12341234</span><span id="word_1_4">$5000</span></div>
<div><span id="word_1_5">22354123</span><span id="word_1_6">$1000</span></div>
</p>
</body>
</html>
```
Your answer should be:
```
"field_name","tag_id"
"COMPANY","word_0_4,word_0_5"
"ACCOUNT","word_1_3"
"BALANCE","word_1_4"
"ACCOUNT","word_1_5"
"BALANCE","word_1_6"
```
###END OF EXAMPLE",
"userRolePrompt": "Now your task is the following:
```
Find all items in the invoice and for each item found extract:
- item name with tag PRODUCT
- description with tag DESCRIPTION
- unit price with tag PRICE.
- quantity with tag QUANTITY
Do not tag table headers. Combine multiple lines of description tag into one tag if possible.
Also extract invoice information:
- Company name of the client with tag CLIENT
- Client address with tag ADDRESS
- Invoice number with tag INVOICENUMBER
- Date of issue with tag ISSUED
- Due Date with tag DUE_DATE
- Total amount, TOTAL
```
Your input HTML is:
```html
{html}
```",
"html": "",
"open_ai_model": "gpt-4o",
"hocr2html": {
"type": "table",
"bbox_to_cell_tolerance_x": 10,
"bbox_to_cell_tolerance_y": 10,
"cell_to_row_tolerance": 20,
"row_to_table_tolerance": 10
},
"tag_to_entity": {
"PRODUCT": "Product Name",
"DESCRIPTION": "Product Description",
"QUANTITY": "Quantity",
"PRICE": "Price",
"CLIENT": "Company Name",
"ADDRESS": "Street Address",
"INVOICENUMBER": "Invoice Number",
"ISSUED": "Invoice Date",
"DUE_DATE": "Due Date",
"TOTAL": "Total Amount"
}
}
where:
prompts_config - the default prompts configuration saved into trained model
messages - a prompt messages structure to use during sending to OpenAI API
html - the document simplified html that model creates and injected into prompt context
open_ai_model - an OpenAI model to use, default is gpt-4o
tag_to_entity - an entity to response tag mapping to map OpenAI tagged document into documents entities
debug - boolean switches debug messages on
hocr2html - HOCR to html rendering configuration
Prompts configuration
The prompts_config is a map of parameter the model use to create a OpenAI request. Model get it from:
configuration parameter of the MlTask call
model default configuration
The MlTask configuration parameter overrides the existing model default configuration, i.e. you can add only a changes iteration into MlTask and keep the existing from default.
Here is a platform task code that prepare Ml call:
It sends request with system ( {systemRolePrompt} ) and user ( {userRolePrompt} ) roles. The {systemRolePrompt} and {userRolePrompt} are refers to keys from the promts configuration.
Only one level key references are allowed in the promts configuration.
The html key is injected by the model and contains minified document.
You can completlly change the default messages structure, or redefine the systemRolePrompt and userRolePrompt.
The userRolePrompt always need to be changed according to your document set and fields you need to extract. It contains field description to extract for OpenAI.
This renderer uses the same table page grouping mechanizm as table redering, but instead of puting <table> into result html, fill out only rows without cell groupping:
<div class="ocr_page"> → <p>
row → <div>
<span class="ocrx_word"> → <span id="word_[Page index]_[Word index on page]">[Word]</span>