Skip to main content

Follow Tagging Rules

Follow Tagging Rules

Consistent

During manual review of the document, a specialist can notice useful data at any place provided: in the body of the document, in the header, etc. From a business perspective, it doesn't matter as long as the value is correct. From the ML-model's perspective, the place and context of the value influence a lot. The Data Analyst helps to select the most optimal place to tag the value and reflects it in the instructions. Therefore, once it's chosen, the place where the value is selected should be observed by everyone while tagging.

The correctly tagged batch of documents should be tagged in a similar manner. Each field in the data set should be tagged consistently for ML training, i.e. fields must be tagged in the same place across layout. The occurrence of the field in the document has to be chosen and adhered to across all documents in the batch.Incorrectly tagged documents contain the same values in different places. 

SCREENSHOT

This rule influences model confidence - in other words, when the value is tagged inconsistently, the probability that the model will extract it in production is decreased. 

Important: Each field should be tagged consistently - at the same place (context) across all the documents of one layout (vendor).

Complete

If the field is available in the document, it should be tagged. If it's OK not to tag certain values under certain conditions, this will be specified by Data Analyst in each particular case. But in most cases, the value selected should be tagged. In 80% of cases where values are omitted, it's due to inattentiveness. That may result in worse results for model training for two reasons:

  1. The model will have fewer samples to train for each field, so statistics for these fields will be lower,
  2. The automation rate will be lower since the data set for model training will contain the same samples of tagged and non-tagged values.

Also, the value should be tagged correctly. All words that make up a string with the value should be included. Any additional words should not be tagged. The shape of the same value should be the same across all the documents.

Completeness has two different aspects:

  1. If a correct value is in the correct form and with the correct context is given in the document, it must be tagged.
  2. If a field value contains multiple words, we should always tag the whole value with all the words belonging to the field, not only a part of it.

 Let's consider an example:

SCREENSHOT

Important: Entity for the tagged string should be corresponding and correct. 

In case it's necessary to exclude some part from the values (for example, there is a requirement that Customer Name shouldn't contain legal endings), it's recommended to tag the full value, correct the data value and apply post-processing to the model results, so that you have full control of what is changed.

Correct

Entity is the business value that the final user will receive. Let's distinguish between tagged value and entity. The tagged value is actually a highlighted string in the document. Entity is data corresponding to the highlighted text displayed in the fields panel. Ideally, in the documents with high-quality OCR, these values will coincide.  For example, The value may differ due to OCR misprints and should be corrected. This is a special case, and how 
to handle it should be specified by the Data Analyst.

Correctness of data values means that in each document, for every configured field, should present correct data value. In other words, the values should be exactly the same as we want the ML model to extract from the original document. Usually, mistakes are caused by:

  • OCR, like iOO8.4O instead of 1008.40 (data value should be corrected manually in the task),
  • By user (if the value is wrongly corrected manually or the wrong value is selected from the list).

SCREENSHOT

Normalized

For some values in Human Task, there is a normalization by default. This is particularly for dates and amounts: Various formats are converted into a single one. In this case, data-value should not be corrected manually. The format of the value is specified by the Data Analyst.

Entities can be given in many different formats across documents. It should be normalized to the same one format in all cases for several reasons: For some fields (for example dates or amounts), the values are normalized in a human task. They should be normalized accordingly after ML extraction. Otherwise, it will be difficult to count the statistics correctly. Another reason for normalization is that the customer may need to upload this data in SAP or some data base and all the values should be of some particular format.

ValuesNormalized
01/18/2017, January 18, 2017, Jan-18-201718/01/17
10,000,000.00 or 10,000,000 or 10 000 000.00 or 10 000 00010000000.00

Diverse

Data set should have good representation of all layouts and fields so as to train ML well. Objects from bad-represented or not-represented layouts may have feature values which make the model treat them like outliers.

Comprehensive

The value and especially context should be comprehensive and accurate so as to provide correct features and weights for extraction by model. So in this regard, attention should be paid to OCR quality of each field and its context.

OCR Issues

Usually, OCR quality is estimated via some percentage of OCRed documents (10%) by the Data Analyst. All the documents, later on, are reviewed during the tagging process. Here it's necessary to understand that until that time, all the most optimal parameters were selected for OCR documents and some quality checks were conducted but still, some bad quality or random documents may appear in the Manual Task while tagging. The model is trained well only on good documents, but then it can process even low-quality documents in production. So for data set collection, it's important to define bad quality documents and exclude them. It's not obligatory but usually, Data Analysts add a checkbox for "Bad OCR quality" on the field panel in the Human Task. This is to define totally corrupted documents and filter them out. 
There are some criteria that will help you define high- and low-quality documents even before OCR. Let's consider them.

The main rule to mark the document as "Bad OCR quality" is that most of the text is either lost or could not be made out. Sometimes for some rare layouts, we include documents even of low quality if documents of higher quality are not available, so that the model can train at least some fields that may not be corrupted during OCR. 
Some criteria of a bad OCR quality document:

  • Most of the fields that need to be tagged are corrupted (more than half or any other number specified by the Data Analyst)
  • There are too many misprints in the surrounding context, so the content of the document could hardly be reproduced without consulting the original document
  • There are some stamps, lines, handwritten text or any other data apart from the necessary text that should not normally be there
  •  Quite often original documents have some particular formats - for example, data is organized in tables or lines, etc. Sometimes this structure is not preserved after OCR and you can see that the table is corrupted, or vice versa, data that supposed to be free text is organized as a table.

SCREENSHOT

Please, refer to Evaluating and Enhancing Image Quality for OCR for more information on OCR quality estimation.