Autoretrain settings

This section describes how to create Settings JSON for autoretraining the following models:

Information Extraction
- Tag rules
Classification

Information Extraction

Autoretrain settings for Information Extraction models contain:

re_tag (boolean) - switches retagging process during autoretraining. When “re_tag” is true, tags specified for the document during previous autoretraining will be overridden. In case “re_tag” is false, tags will be specified for newly added documents.
test_to_train_percentage (decimal) - the ratio of the test set to the training set. For example, a value of 0.3 will distribute the documents as follows: 30% of the total number of documents will be defined in the test set and 70% will be defined in the training set.
train_set - training set settings:
- tags (string) - the rule or rules applied to autoretraining process. For more details on Rules please follow Tag rules.
- min (number) - minimal number of documents that can be included in the training set to start the autoretraining process. Upon reaching this value, documents will be included in the test set.
- max (number) - maximum number of documents that will be included in the training set. Upon reaching this value, in case of adding new documents to the document set, old documents are to be removed from the training set and to be replaced with new documents.
test_set - test set settings:
- tags (string) - the rule or rules applied to autoretraining process. For more details on Rules please follow Tag rules.
- min (number) - minimal number of documents that can be included in the test set to start the autoretraining process. Upon reaching this value, documents will be included in the training set.
- max (number) - maximum number of documents that will be included in the test set. Upon reaching this value, in case of adding new documents to the document set, old documents are to be removed from the test set and to be replaced with new documents.
switch_best_model - options for switching to the best model during autoretraining process:
- enable (boolean) - enables switching the best model trained during the autoretraining process. For example, if the model trained in the last round of autoretraining is better in metrics, in the Models module it will be marked with a star as the most qualitative. Best model also can be selected manually in the Models module.
- re_generate_best_model_report (boolean) - if the value is true, report for the best model will be generated every launch, if the value is false, report will be generated only in case it is not present.
assessment_rule - specifies options for choosing best model:
- type (string) - supports only two values: perDocument and perField. When perDocument is set, the criteria for selecting the best model will be the average metrics. When perField is set, the criteria for selecting the best model will be the average of each field metrics. Accordingly, if the extraction improves for all fields in the current model, the current model will be selected as the best one. If the extraction quality improves for one field and worsens for all others, the current model will not be selected as the best one.
- exclude_keys (string) - fields that will be excluded during autoretraining process.
cleanup - sets the maximum number of documents in a document set. Once the maximum number is reached, documents will be removed from the document set starting with the oldest one.
- train_set - training set settings:
  - max (number) - the maximum number of documents in a training set.
- test_set - test set settings:
  - max (number) - the maximum number of documents in a test set.
config - configuration settings for model trained during autoretraining process. For more details, please refer to Training Configuration File for Information Extraction Models.
average_group_keys_assessment (boolean) - when the true option is selected, the average metrics will be used when evaluating the quality of the model, while if false is selected, each group field's metrics will be evaluated separately.

Tag rules

ALL_MATCHED - all fields in Extracted Data column are filled in and match with Document tags;
REQUIRED_MATCHED - all required fields in Extracted Data column are filled in and match with Document tags;
ANY_MATCHED - at least one field(required or not required) is filled in and matches with Document tag;
ALL_CONTAINS - all fields in Extracted Data column are filled in but not exactly match with Document tags(e.g. extra symbol due to OCR mistake);
REQUIRED_CONTAINS - all required fields in Extracted Data column are filled in but not exactly match with Document tags;
ANY_CONTAINS - at least one field(required or not required) is filled in but not exactly match with Document tag;
ALL - all fields in Extracted Data column are filled in, Document tags can be missing, contain or match extracted fields;
REQUIRED - all required fields in Extracted Data column are filled in, Document tags can be missing, contain or match extracted fields;
ANY - at least one field(required or not required) is filled in, Document tag can be missing, contain or match extracted field.

Classification

Autoretrain settings for Classification models contain:

re_tag (boolean) - switches retagging process during autoretraining. When “re_tag” is true, tags specified for the document during previous autoretraining will be overridden. In case “re_tag” is false, tags will be specified for newly added documents.
test_to_train_percentage (decimal) - the ratio of the test set to the training set. For example, a value of 0.3 will distribute the documents as follows: 30% of the total number of documents will be defined in the test set and 70% will be defined in the training set.
train_set - training set settings:
- min (number) - minimal number of documents that can be included in the training set to start the autoretraining process. Upon reaching this value, documents will be included in the test set.
- max (number) - maximum number of documents that will be included in the training set. Upon reaching this value, in case of adding new documents to the document set, old documents are to be removed from the training set and to be replaced with new documents.
test_set - test set settings:
- min (number) - minimal number of documents that can be included in the test set to start the autoretraining process. Upon reaching this value, documents will be included in the training set.
- max (number) - maximum number of documents that will be included in the test set. Upon reaching this value, in case of adding new documents to the document set, old documents are to be removed from the test set and to be replaced with new documents.
switch_best_model - options for switching to the best model during autoretraining process:
- enable (boolean) - enables switching the best model trained during the autoretraining process. For example, if the model trained in the last round of autoretraining is better in metrics, in the Models module it will be marked with a star as the most qualitative. Best model also can be selected manually in the Models module.
- re_generate_best_model_report (boolean) - if the value is true, report for the best model will be generated every launch, if the value is false, report will be generated only in case it is not present.
assessment_rule - specify options for choosing best model:
- type (string) - supports only two values: perDocument and perField. When perDocument is set, the criteria for selecting the best model will be the average metrics. When perField is set, the criteria for selecting the best model will be the average of each field metrics. Accordingly, if the extraction improves for all fields in the current model, the current model will be selected as the best one. If the extraction quality improves for one field and worsens for all others, the current model will not be selected as the best one.
- exclude_keys (string) - fields that will be excluded during autoretraining process.
cleanup - sets the maximum number of documents in a document set. Once the maximum number is reached, documents will be removed from the document set starting with the oldest one.
- train_set - training set settings:
  - max (number) - the maximum number of documents in a training set.
- test_set - test set settings:
  - max (number) - the maximum number of documents in a test set.
config - configuration settings for model trained during autoretraining process. For more details, please refer to Training Configuration File for Classification Models.