Skip to main content

Document Sets

Document Sets

 

Document Sets module allows to create and view document sets used to train ML models along with their names, descriptions, creation/update time as well as data type and automation process used. Here users can edit document sets, populate them with necessary documents, assign correct labels and mark entities, prepare training data, and train models.

The system also provides the ability to import document sets as a single package with all the data, labels, and models included. Document Sets along with their content can also be deleted.

You can access the module by clicking Machine Learning Document Sets. Required Permission: DocumentSet-READ. See Role Permissions .

Document Sets Navigation

Manage existing Document Sets

Columns Description

  • Name - the Document set name. By clicking on the document set name you can see the documents included into the document set.
  • Document Type - the document type defines how the documents for Human Task are displayed for tagging, which output fields should be extracted during the tagging and Machine Learning training.
  • Document Processor - the name of an automation process that controls life-cycle of a document set and includes preparation steps to display documents in the Human Task. By clicking on the Automation Process you can see the process runs.
  • Description - a short description of the Document Set
  • Last Update - the last date and time when the documents were changed.

Control icons

  • Refresh - to pull the last updates from the server.
  • Train Model - to train an ML model on documents contained in the Document Set. Please note, it's required to prepare a training set first. Required Permission: MlModel-CREATE.
  • Delete - to delete the Document Set. Required Permission: DocumentSet-DELETE.

Table Settings

Table settings allow you to manage the table view. Click the icon  to start working with the table settings. The table settings can be managed with the following buttons: 

  • Advanced filter - to switch the advanced filters for the columns.
  • Columns Display - to select the columns that will be displayed in the table.
  • Apply - to apply the changes made to the table settings.
  • Сancel - to cancel the last actions with the table settings.

Filter by text 

Filtering allows you to search the Document Sets by Name, Document Type, Document Processor, Description.

Advanced filters by columns

Advanced Filter allows you to extract a list from a table with predefined criteria. Click the icon  to start working with the advanced filter. The advanced filters can be managed with the following buttons: 

  • Clear filter - to reset all the proposed advanced filter criteria for the column.
  • Сancel - to cancel the last actions with the proposed criteria for the column.
  • Apply - to filter the table according to the proposed criteria for the column.

Sorting

Ascending/descending sorting is allowed for Name, Document Type, Document Processor, Model, Created By, Creation Date, Last Update columns.

Create Document Set

To create a new Document Set, you need to:

  • Navigate to the Document Sets. Required Permissions to get there: AutomationProcess-READ and DocumentType-READ.
  • Click CREATE NEW button. Required Permission: DocumentSet-CREATE. See Role Permissions.


Create a new Document Set

  • The New Document Set panel is displayed on the right.

Create a new Document Set Panel

  • Enter a document set Name.
  • Optionally enter a short document set Description.
  • Select a Document Type from the drop-down list.

The necessary document type should be created prior to creating a document set. A document type defines how the documents are displayed on Human Task for tagging, which output fields should be extracted during the tagging and Machine Learning training. If the required document type is missing from the drop-down list range navigate to Administration → Document Types to create a new document type. For more details see Create a new Document Type.

  • Select a Document Processor from the drop-down list.

A document processor is an automation process that controls the document set workflow and includes preparation steps to display documents on the Human Task. The necessary document processor should be created prior to creating a document set. Here is the list of Out of the box Document Processors. If the required document processor is missing from the drop-down list range navigate to Automation Processes to create a new document processor. For more details see Create a new Automation Process

  • Optionally select a ML model from the drop-down list to process the document set. More information about ML is displayed in the "Model Options" window after clicking the icon

  • Click ADD button to upload a .zip file with documents for a new document set. A file explorer window is displayed. Select a .zip file to be uploaded into the document set.
  • Provide OCR configuration including Storage bucket name, Tesseract options, Image Magick and HocrFixWords options in the Settings field below Add button.
    • document_bucket – name of Storage bucket where OCR results will be saved;
    • tesseractOptions - Tesseract OCR command line options. Please see the external documentation Tesseract Command Line Usage.
    • imageMagickOptions - ImageMagick command line options. ImageMagick tool is used to split pdf by pages and print as images. Please follow Image Magick Command Line Documentation.
    • hocrFixWords - Map with string key-value pairs. Key is a regex and it is used to find mistakes in each word separately. All matches are replaced by value string.
  • Click CREATE button the save the document set. A new document set is created.

Create a new Doc Set button

To create a new document set, you need to be granted DocumentSet-CREATE permission (with AutomationProcess-READ and DocumentType-READ). See Role Permissions.

Edit Document Set

To edit key information about an existing document set click on the corresponding row with the document set. Edit Document Set panel is displayed on the right. Required Permissions: DocumentSet-UPDATE (with AutomationProcess-READ and MlModel-READ). See Role Permissions.

You can edit the following information about a document set on the displayed panel:

  • Name – the document set name;
  • Description – the document set short description;
  • Document Type – a human task document type to be selected from the dropdown list, a document type defines how the documents are displayed on Human Task for tagging, which output fields should be extracted during the tagging and Machine Learning training;
  • Document Processor – an automation process for the document set workflow, it includes preparation steps to display documents on the Human Task;
  • Model – a ML model used to process the document set.
  • Settings - OCR configuration including Storage bucket name, Tesseract options, Image Magick and HocrFixWords options:
    • document_bucket – name of Storage bucket where OCR results will be saved;
    • tesseractOptions - Tesseract OCR command line options. Please see the external documentation Tesseract Command Line Usage.
    • imageMagickOptions - ImageMagick command line options. ImageMagick tool is used to split pdf by pages and print as images. It can also be used to improve image quality for better OCR result. Please follow Image Magick Command Line Documentation. If the settings contain ${source} parameter it is updated by image replacement, if there is no any ${source} in the settings the image source is added after all parameters. This way you can adjust which parameters should be applied to the input image and which to the output image.
    • hocrFixWords - Map with string key-value pairs. Key is a regex and it is used to find mistakes in each word separately. All matches are replaced by value string.

To manage documents within an existing document set click Details and proceed to the Documents page.

To save the result of editing click the UPDATE button.


To edit a document set, you need to be granted DocumentSet-UPDATE permission. See Role Permissions.

Delete Document Set

There are 2 ways to Delete Document Set:

  • Press the control icon Delete.
  • Choose the Document Set or all Document Sets and press icon Delete.

To delete a document set, you need to be granted DocumentSet-DELETE permission. See Role Permissions.

Train Model

To train a model on a prepared training set, you need to:

  • Click Train Model icon.

  • The following dialog box appears:

  • In the dialog box you need to provide the following details:
    • Name - the name of the model to be trained.
    • Description – the short description.
    • Version – the model version.
    • Training Configuration - the settings for model training. Default settings are generated in accordance with the Document type and Human Task Type used for the Document Set.
  • Click TRAIN to start ML model training.

To train a model, you need to be granted MlModel-CREATE permission. See Role Permissions.

Model training can also be launched from the Documents page. See Documents.