Skip to main content

Training and Test Sets Split

Training and Test Sets Split

When the data set is collected and checked, Data Analyst splits it into two parts: a training set and a test set. We use the training set for training the ML model on gold values. For Information Extraction, the gold values are a set of fields that should be extracted; for Classification, documents are accompanied by a target class. The training set composes 80% of the whole data set. The test set is unseen documents used for evaluating the ML performance, and to test it for possible exception cases in production. The test set composes 20% of the whole data set and is not used for model training.

As a result of a finished data set check, Data Analyst should assign the documents the following categories:

CategoryDescription
GoodAbsolutely correct and consistently tagged documents, in which no values are missed, everything is tagged correctly, OCR is good
For Test 
(only)

Corner cases (Bad OCR, Rare Template) - documents, which are not eligible for training, but which exist in real life documents' flow and can be processed (at least partially).
These are documents for which entities for ALL THE FIELDS presented in the original document (except handwritten) are correct, but:

  • they belong to rare template, there are less than 10 such documents and they significantly differ from the others (so the Model will not be able to be trained well on such a small number of examples)
  • or after OCR there are some issues: values with OCR mistakes, broken structure (so field can be tagged only with "Append Mode" option), context of fields has BAD OCR, etc.
For 
Training 
(only)

Positions of presented tags are correct and tags are complete, but some tags are missed (usually due to OCR issue). Also documents with wrong data-values or tag missed due to tagging mistake can be used for training, but only if there is no time or capacity for re-tagging. Normally it is better to re-tag such documents

Re-tag

Document contains some issues with both data-values and tags, but it still can become "Good", "For training" or "For test" after retagging, for example:

  • some value wasn't tagged although presented in the document
  • field is tagged in wrong place (inconsitency)
  • there is not fully tagged field (incompleteness)
  • BAD OCR is tagged and extracted entity is not correct
Exclude

None of the above mentioned.
Document has some considerable issue, so it is inappropriate neither for training or test sets. Usually it is documents with really bad original quality or with considerable OCR issue.

For example:

  • there are some serious OCR problems for all fields, nothing is recognized correctly.
  • there are only handwritten fields in the document

Training set

Training Set is a set of documents used for training the ML model. The training set composes 80% of the whole data set. Training set contains gold values for model training. For Information Extraction models, gold values is a set of fields that should be extracted; for Classification, documents are accompanied by target class. 

Training set should consist only of:

CategoryDefinitionQuantity
Good Absolutely correct and consistently tagged documents, in which no values are missed, everything tagged 
correctly, OCR is good
80% of documents 
from each layout
For 
Training 
(only)
Positions of presented tags are correct and tags are complete, but some tags are missed (usually due to OCR 
issue).
Also documents with wrong data-values or tag missed due to tagging mistake can be used for training, but only if there is no time or capacity for re-tagging. Normally it is better to re-tag such documents.
all documents of 
this category

Test set

Test set is a set of unseen documents used for evaluating an ML model. Test set composes 20% of the whole data set not used for model training. The aim will be to run extraction on documents that haven't participated in training. The second aim of the test set is to check the model performance for possible exception cases in production.

Test set should consist of:

CategoryDefinitionQuantity
Good absolutely correct and consistently tagged documents, in which no values are missed, everything is tagged 
correctly, OCR is good
20% of documents 
from each layout
For 
Test

Corner cases (Bad OCR, Rare Template) - documents, which are not eligible for training, but which exist in real life documents' flow and can be processed (at least partially).  These are documents for which data-values for ALL THE FIELDS presented in the original document (except handwritten) are correct, but:

  •  they belong to rare template, there are less than 10 such documents and they significantly differ from the others (so the Model will not be able to be trained well on such a small number of examples) 
  • after OCR there are some issues: values with OCR mistakes, broken structure (so field can be tagged only with "Append Mode" option), context of fields has BAD OCR, etc.
same quantity as 
in production flow

Documents with OCR issues must be obligatory marked (keep marks in the "Category" column and if possible in "Comments" in the input file for test extraction), because you should be able to identify them in the final report on statistics and explain the effect of such documents in final results.

If it's not possible to count the number of bad documents in the production flow, the results of OCR testing can be used.

How to split

  • Figure out required quantity of "Good" category documents for both training and test set based on 80% / 20% splitting. Use pivot table for more convenience.
  • Copy and paste to separate spreadsheets required number of documents for training and test sets, respectively. Use filtering for more convenient navigation.
  • Save spreadsheets for test and for training sets separately in CSV format.

Unequally distributed layouts in test and training sets or wrong categories of documents in the test or training sets will result in model training on wrong data or not trained for some fields / layouts at all.

The data values from the test set are called "gold values" and they are going to be compared with the results of the ML model extraction. It means that in order to get real and reliable estimation of an ML model performance for each field in every document of the test set, we must have 100% correct data values and all the fields represented. If anything is missed or incorrect, it will cause incorrect evaluation of the model. In other words, in a test set there should be exact values, which we expect the ML model to extract.