Skip to main content

Preparing for Tagging

Preparing for Tagging

This stage is critically important because the Data Analyst establishes the foundation for the entire data collection process. Any issues overlooked or unresolved at this point can lead to significant problems in later stages. The main aim of this stage is to prepare everything for tagging to start.

Key activities in this stage include:

  • Defining the tagging logic to ensure consistent and accurate tagging criteria,
  • Creating clear, comprehensive tagging instructions to guide SMEs who will tag the documents,
  • Splitting documents into manageable batches, ideally grouped by layout or content type,
  • Designing the human task and sending the batches of documents into Workspace ready for tagging.

At the end of this stage, the Data Analyst assigns specific human tasks for each SME and provides a batch and instructions for it. In the result of this stage, the batches and human tasks are prepared and assigned to SMEs, and all the necessary preparation for the first tagging iteration is conducted.

  • The Human Task should be configured according to business logic of all the documents. If a mistake is revealed in the process of tagging, it will take extra time and effort to reconfigure the Human Task, re-tag documents, and even retrain the model (if training iterations start in parallel with tagging).
  • Inconsistent and incorrect tagging can result in bad quality for the data set or extra time to re-tag and verify documents. 

The following sections will guide you through each step in detail.

Start Working with EasyRPA

To log in to the system for the first time, follow the link to EasyRPA.

If you are an LDAP user, enter your LDAP username and password on the login screen. Upon successful login, your user account will be automatically created in the system.

If you are an LDAP user and unable to log in, contact your System Administrator to verify whether you belong to the appropriate LDAP group.
If you are not an LDAP user, contact your System Administrator to create a user account for you in EasyRPA.

Human Task Design

The human task can be created for one of the following most common use cases:

  • Information Extraction (IE)
  • Classification

For model training, we need to find and tag entities (fields) in a document. They will become values for the model to train on.

Human Task Design Best Practices

To create a good human task for each case, it's necessary to consider business logic of all the documents, types of the documents and their layouts. Special attention should be paid to fields' logic and appearance in Information Extraction use cases.

Split into batches

When working with a large training set - especially if it contains over 100 documents or includes multiple distinct layouts - it can be challenging for a single SME to tag all documents in one session effectively. To improve focus and maintain high-quality annotations, it is recommended to divide the dataset into smaller batches of 30 to 50 documents, ideally grouped by similar layout or format.

Single or Multiple

The data analyst reviews the set of fields and defines what fields have only one value in the document and what have several values or, in other words, is it a single value or multiple value field.

  • Single value field - field for which the result value will be the only one in each document. Typical examples are Invoice Number, Invoice Date, Total.
  • Multiple value field - field for which the result value will be plural. Typical examples are Product Name, Product Description, Quantity, Price.
It's necessary to distinguish clearly in advance what fields are single value and what are multiple value and all the cases for each, because if this is missed and revealed in later stages, extra effort will be required to redesign the manual task and retrain the model.

Required or Optional

  • Required field - field which is presented in all the documents in all the batches. For example, Invoice Number, Invoice Date, etc.
  • Optional field - field which is presented only in some documents or in some batches. These fields usually have special additional logic.

Required and optional fields need to be considered together with SMEs. However, if the number of fields is 10 or higher, it's strongly recommended to make it impossible to submit the task without making sure that all the necessary values have been tagged.

Assign Hot Keys

Hot keys are really helpful for quick tagging. To increase speed, assign each field a hot key and teach SMEs how to use them. There are two approaches to assigning convenient hot keys:

  1. Assign intuitively understandable and memorable letters - for example, first letters of fields: 
    d = date, 
    p = price, 
    n = invoice number, etc.
  2. Assign letters and figures that are close to each other on the keyboard:
    1, 2, 3 = first three fields
    q, w, e = second three fields

 Split long entities

Check whether the value can be split into some independent parts - for example, the whole address line can be very difficult to extract, as address formats usually differ from country to country, or even within 
different documents' templates. It's recommended to split the value into shorter entities that are situated in onedocument (Street Address, City, State, Zip code, etc.).

Be careful with value splitting. Values like Company Name however long they can be should not be split, as their parts cannot be considered independent and the value only makes sense with the whole name.

Enable additional fields

Additional fields can make the human task output more informative (but only if necessary):

  • notes - a textarea field. Notes can be provided to select documents quicker. While tagging, SMEs will provide notes on what is wrong with the document and that can be further picked out. Provide an optional field for comments in the human task where SMEs can mention information about a missing field, document structure, etc.
  • radio button to classify documents - If you have, for example, a combined data set of four different invoices, it will be useful to add a special field with the list of all four invoices aas a radio button select. This will simplify further analysis of the data set and its grouping.
  • checkbox to mark an invalid document - Even after OCR tuning, there still can be documents with bad OCR. To avoid tagging such documents and easily find them later, it's recommended to include a special checkbox.

Don't make your Human Task too big

It is recommended including 7–10 fields in one human task. If the use case implies extraction of a larger number of fields, it's better to create several tasks. So, if, for example, you need to extract 30 fields, create three separate tasks with three different sets of fields in each. This way, your data set will be processed three times through three different human tasks. Otherwise, too many fields in one task may lead to lower concentration and, consequently, lower quality of the data set.

Provide clear Tagging Instructions

Make sure to provide clear instructions for SMEs who will tag the documents. ML models learn patterns directly from the tagged data. Inconsistent or ambiguous tags teach the model conflicting patterns.  Clear instructions result in clean data, allowing the model to learn correct patterns for better information extraction accuracy. 

How to Design IE Human Task

 To create an IE Human Task, follow these steps:

  1. Create a Document Type - define the entities to be extracted and their properties;
  2. Create a Document Set - establish a structured collection of related documents. 
  3. Upload the Training Set - import the prepared training set into the created document set.
  4. Run Preprocess action - apply transformations to optimize document format and readability.
  5. Run Send to Workspace action - make the processed documents available for tagging in Workspace.

Before proceeding, ensure that your dataset is split into a training set (80%) and a test set (20%). Only the training set should be uploaded and tagged for model training in Information Extraction.

Create a Document Type

A Document Type determines the structure and presentation of documents in a Human Task, specifying which output fields should be extracted during tagging and usedfor Machine Learning training.

To create a new Document Type for Information Extraction use case:

  • Navigate to Administration → Document Types page.

  • Create a new Document Type (Refer to the Create a new Document Type guide for detailed instructions).
  • During creation, ensure that you:
    • Select Information Extraction Task as Human Task Type.
    • Define the Fields to be extracted, provide Task Title and Taging Instructions and some additional configurations in the JSON structure under the Settings section. Key JSON settings include:
      • taskInstructionText (string, optional) - instruction text displayed in the popup window.

      • taskTypeLabel (string, optional) - configures the task title (default: "Information Extraction").

      • categories (list of objects) (required) - specifies the fields to extract and their parameters.

      • Additional options: 

        • Extra fields can be added under the "More" tab (e.g., notes).

        • Postprocessors can be defined to normalize extracted fields after tagging.

Here is an example of Information Extraction Document Type JSON Settings:

Information Extraction Document Type JSON Structure example
{
	"appLanguage": "en",
	"taskTypeLabel": "IDP Sample Invoice Document Information Extraction",
	"taskInstructionText": "Please extract fields from provided document.",
	"allowCustomValue": true,
	"excludeUndefinedEntities": true,
	"flowDocType": "IDP_INVOICE",
	"categories": [
		{
			"name": "Invoice Number",
			"multiple": false,
			"required": true,
			"validationRegExp": "\d{10}",
			"errorMessage": "Invoice Number should be a 10 digit number.",
			"hotkey": [
				"n"
			]
		},
		{
			"name": "Invoice Date",
			"multiple": false,
			"required": true,
			"hotkey": [
				"i"
			]
		},
		{
			"name": "Due Date",
			"multiple": false,
			"required": true,
			"hotkey": [
				"d"
			]
		},
		{
			"name": "Company Name",
			"multiple": false,
			"required": true,
			"hotkey": [
				"c"
			]
		},
		{
			"name": "Street Address",
			"multiple": false,
			"required": false,
			"hotkey": [
				"1"
			]
		},
		{
			"name": "City",
			"multiple": false,
			"required": false,
			"hotkey": [
				"2"
			]
		},
		{
			"name": "Zip Code",
			"multiple": false,
			"required": false,
			"hotkey": [
				"3"
			]
		},
		{
			"name": "Phone Number",
			"multiple": false,
			"required": false,
			"hotkey": [
				"4"
			]
		},
		{
			"name": "E-mail",
			"multiple": false,
			"required": false,
			"hotkey": [
				"5"
			]
		},
		{
			"name": "Product Name",
			"multiple": true,
			"required": true,
			"group": "products",
			"hotkey": [
				"6"
			]
		},
		{
			"name": "Product Description",
			"multiple": true,
			"required": false,
			"group": "products",
			"hotkey": [
				"7"
			]
		},
		{
			"name": "Quantity",
			"multiple": true,
			"required": false,
			"group": "products",
			"hotkey": [
				"8"
			]
		},
		{
			"name": "Price",
			"multiple": true,
			"required": true,
			"group": "products",
			"hotkey": [
				"9"
			]
		},
		{
			"name": "Tax Rate",
			"multiple": false,
			"required": false,
			"hotkey": [
				"t"
			]
		},
		{
			"name": "Discount Rate",
			"multiple": false,
			"required": false,
			"hotkey": [
				"r"
			]
		},
		{
			"name": "Total Discount",
			"multiple": false,
			"required": false,
			"hotkey": [
				"s"
			]
		},
		{
			"name": "Total Amount",
			"multiple": false,
			"required": true,
			"hotkey": [
				"a"
			]
		}
	],
	"metadata": [
		{
			"name": "isInvalid",
			"markLabel": "INVALID Document",
			"description": "Select, if you have problem with the document"
		},
		{
			"name": "error_message",
			"label": "Problem explanation",
			"type": "textarea",
			"required": false
		},
		{
			"name": "notes",
			"label": "Document notes",
			"type": "textarea",
			"required": false
		},
		{
			"name": "document_type",
			"label": "Document type",
			"type": "radio_group",
			"required": false,
			"items": [
				{
					"value": "INVOICE 1",
					"disabled": false
				},
				{
					"value": "INVOICE 2",
					"disabled": false
				},
				{
					"value": "INVOICE 3",
					"disabled": false
				},
				{
					"value": "INVOICE 4",
					"disabled": false
				}
			]
		}
	],
	"preprocessPostProcessors": [
		{
			"name": "removeWordIfConfidenceLessThan",
			"confidence": "50.0"
		}
	],
	"mlPostProcessors": [
		{
			"entityName": "Invoice Number",
			"name": "regexReplacement",
			"rules": {
				"o|O|e|c|C|Q|p|P": "0",
				"I|i|j": "1",
				"b|G": "6",
				"B": "8",
				"q": "9"
			}
		},
		{
			"entityName": "Quantity",
			"name": "regexReplacement",
			"rules": {
				"o|O|e|c|C|Q|p|P": "0",
				"I|i|j": "1",
				"b|G": "6",
				"B": "8",
				"q": "9"
			}
		},
		{
			"entityName": "Price",
			"name": "regexReplacement",
			"rules": {
				"o|O|e|c|C|Q|p|P": "0",
				"I|i|j": "1",
				"b|G": "6",
				"B": "8",
				"q": "9",
				"\D": " "
			}
		},
		{
			"entityName": "Tax Rate",
			"name": "regexReplacement",
			"rules": {
				"o|O|e|c|C|Q|p|P": "0",
				"I|i|j": "1",
				"b|G": "6",
				"B": "8",
				"q": "9"
			}
		},
		{
			"entityName": "Discount Rate",
			"name": "regexReplacement",
			"rules": {
				"o|O|e|c|C|Q|p|P": "0",
				"I|i|j": "1",
				"b|G": "6",
				"B": "8",
				"q": "9"
			}
		},
		{
			"entityName": "Total Discount",
			"name": "regexReplacement",
			"rules": {
				"o|O|e|c|C|Q|p|P": "0",
				"I|i|j": "1",
				"b|G": "6",
				"B": "8",
				"q": "9"
			}
		},
		{
			"entityName": "Total Amount",
			"name": "regexReplacement",
			"rules": {
				"o|O|e|c|C|Q|p|P": "0",
				"I|i|j": "1",
				"b|G": "6",
				"B": "8",
				"q": "9"
			}
		},
		{
			"entityName": "Invoice Number",
			"name": "trim"
		},
		{
			"entityName": "Product Name",
			"name": "mergeCloseEntities",
			"width": 8,
			"height": 42
		},
		{
			"entityName": "Product Description",
			"name": "mergeCloseEntities",
			"width": 8,
			"height": 42
		},
		{
			"name": "ocrPositionBasedGrouping"
		}
	],
	"validators": [
		{
			"entityName": "Quantity",
			"name": "isBigDecimal",
			"strict": true,
			"message": {
				"severity": "error",
				"text": "Quantity should be a number."
			}
		},
		{
			"entityName": "Price",
			"name": "isAmount",
			"message": {
				"severity": "error",
				"text": "Price should be an amount."
			}
		},
		{
			"entityName": "Tax Rate",
			"name": "isBigDecimal",
			"message": {
				"severity": "error",
				"text": "Tax Rate should be number."
			}
		},
		{
			"entityName": "Discount Rate",
			"name": "isBigDecimal",
			"message": {
				"severity": "error",
				"text": "Discount Rate should be number."
			}
		},
		{
			"entityName": "Total Discount",
			"name": "isAmount",
			"message": {
				"severity": "error",
				"text": "Total Discount should be an amount."
			}
		},
		{
			"entityName": "Total Amount",
			"name": "isAmount",
			"message": {
				"severity": "error",
				"text": "Total Amount should be an amount."
			}
		},
		{
			"name": "idpSampleValidateInvoiceAmounts"
		}
	]
}

For a detailed reference on Information Extraction Document Type JSON settings, see Document Type Settings JSON Structure.

To explore the Document Types module UI, refer to Document Types.

Create a Document Set

The Document Sets module enables users to create, view, and manage document sets used for training machine learning models.

To create a new Document Set for Information Extraction use case:

  • Navigate to Machine LearningDocument Sets page. 

  • Create a new Document Set (Refer to the Create a new Document Set guide for detailed instructions).
  • During creation, ensure that you: 
    • Select the Document Type you have created on the previous step in the dropdown,
    • Select IE Document Processor in Document Processor dropdown,
    • Configure OCR settings as JSON structure under the Settings section, specifying Storage bucket name, Tesseract options, Image Magick settings, and HocrFixWords parameters.

Here is an example of Information Extraction Document Set JSON Settings:

Document Set Settings example
{
	"bucket": "data",
	"ocrType": "tesseract",
	"imageMagickOptions": [
		"-units",
		"PixelsPerInch",
		"-density",
		"180",
		"${source}",
		"-background",
		"white",
		"-alpha",
		"remove",
		"-deskew",
		"40%",
		"-normalize",
		"-quality",
		"100",
		"-resample",
		"180"
	],
	"tesseractOptions": [
		"-l",
		"eng",
		"--psm",
		"12",
		"--oem",
		"3",
		"--dpi",
		"180"
	],
	"paddleOcrOptions": [
		"--lang",
		"en"
	],
	"debug": [
		"images"
	],
	"autoTraining": {
		"re_tag": false,
		"test_to_train_percentage": 0.3,
		"train_set": {
			"tags": [
				"ALL_MATCHED"
			],
			"min": 20,
			"max": 500
		},
		"test_set": {
			"tags": [
				"ALL_MATCHED"
			],
			"min": 6,
			"max": 165
		},
		"switch_best_model": {
			"enable": true,
			"re_generate_best_model_report": false,
			"assessment_rule": {
				"type": "perDocument",
				"exclude_keys": []
			},
			"average_group_keys_assessment": true
		},
		"cleanup": {
			"train_set": {
				"max": 3000
			},
			"test_set": {
				"max": 900
			}
		},
		"config": {
			"trainer_name": "ml_ie_spacy2_model",
			"trainer_version": "3.3.0",
			"trainer_description": "Auto training IE",
			"train_config": {
				"lang": "en",
				"iterations": 30
			},
			"process_config": {
				"concat_single_entities": true
			}
		}
	},
	"task": "eu.ibagroup.sample.ml.idp.tasks.AddInvoiceTask"
}

To explore the Document Sets module UI, refer to Document Sets.

To explore the OCR settings, refer to OCR Tuning Guide.

Upload the Training Set

You can upload input documents individually or as a zip archive. Documents can be added during the creation of a new Document Set or uploaded later if an empty Document Set has already been created.

Uploading documents while creating a New Document Set

  1. Click Add button to upload a .zip file with documents or individual documents to a new Document Set. A file explorer window is displayed. 
  2. Select a .zip file to be uploaded into the document set.

Uploading Documents to an Existing Document Set

  1. Navigate to the Document Set you have created.
  2. Click Upload Documents.
  3. Click Add.
  4. Select either a zip archive or individual files to upload.
  5. Click Upload to complete the process.

Preprocess and Send to Workspace

After uploading the training set and configuring all necessary settings, run the IE Document Processor actions to:

  • Optimize document format and readability.

  • Perform OCR processing.

  • Make the processed documents available for tagging in Workspace.

Steps to run IE Document Processor on the documents:

  1. Navigate to the Document Set you have created.
  2. Click Process Documents.
  3. Select Preprocess and Send to Workspace checkboxes.
  4. Click Process to start the automation process.

Once completed, the documents will be available in Workspace for tagging.

To explore the Documents module UI, refer to Documents.

Assigning Human Tasks to Workspace Groups

The Workspace Groups functionality provides a flexible framework for assigning human tasks for document tagging (reviewing) to specific groups of users. This allows data analysts and administrators to provide task assignment and user permissions to individual use cases, optimizing workflow efficiency and data security.

A key benefit of using Workspace Groups is the ability to separate rights and responsibilities among different worker groups. For example, you can create distinct groups for:

  • General Taggers: Workers with broad access to accept any available Human Task.
  • Specialized Taggers: Teams restricted to working on specific Document Types.
  • Quality Auditors: Users with read-only access to monitor progress and metrics without making changes.

Procedure: How to Configure a Workspace Group for Document Tagging

Prerequisites:

  • A Document Set must be created, configured, and have documents uploaded.
  • Users must be created in the system.

Configuration Steps:

  1. Create a Workspace Group:
    1. Navigate to Administration > Workspace Groups.
    2. Create a new Workspace Group (e.g., WG_Invoice).
  2. Create and Configure a User Group:
    1. Navigate to Administration > Group Management.
    2. Create a new User Group (e.g., UG_Invoice_Taggers).
    3. Assign the necessary permissions for the Workspace Group's Context ID (see "Permission Scenarios" below for details).
    4. Add the relevant users to this group.
  3. Assign the Workspace Group to a Document Set:
    1. Navigate to the Document Sets module.
    2. In the JSON settings for the Document Set, specify the workspaceGroup parameter:
      {
      	"workspaceGroup": "WG_Invoice"
      }
    3. Send the documents to the Workspace.

Permission Scenarios for Human Tasks

Assign the following permissions to the User Group for the specific Workspace Group Context ID to achieve the desired behavior:

ScenarioRequired PermissionsUser Experience
1. Full-Function Taggers
(Users can choose any task from any document type)
READ, UPDATEUsers can freely accept, save, complete, and choose any available Human Task from the Workspace.
2. Document Type Specialists
(Users are limited to tasks from a specific document type)
READ, CREATEUsers can only start available tasks from within a specific Document Type row. The system assigns a random available task from that type; users cannot choose individual documents. They can accept, complete, and save these assigned tasks.
3. Randomized Task Assignment
(Users receive random tasks from any type to prevent "cherry-picking")
READ, ACTIONUsers must use the global "Start Available Task" button. The system assigns a random task from any available document type. Users cannot choose document types or specific documents.
4. Ability to Skip TasksAdd DELETE to any scenario aboveUsers have the additional ability to skip an accepted Human Task.
5. Read-Only Observers
(For managers and quality auditors)
READUsers can view the Workspace Group, its details, performance metrics, and notifications. They can preview available Human Tasks but cannot accept or interact with them.

To explore the Workspace Groups module UI, refer to Workspace Groups.

To explore the Group Role Permissions, refer to Role Permissions.