Data Analyst Responsibilities

Preparatory Work
Main Development

Preparatory Work

As a prerequisite to model training, Data Analyst investigates the customers' documents (their types and formats, frequency of use, number of fields and requirements, possible answer types). DA may receive information concerning number of SMEs, their workload, and then plans how to organize the Data set collection. DA has to make sure the documents are suitable for OCR. DA should gather all the necessary information from customer ASAP and think through possible scenarios of work so as to minimize time required for planning and consulting at the customer’s site.

Main Development

The main goal of the Data Analyst is data set collection. The better quality the data set, the higher results of ML.

Data set should respond to several requirements. DA should know as much as possible about logic, workflow of the documents, document layouts and production distribution. When the production distribution is known, it should be followed in the data set. Unknown distribution is a more common case and requires more effort, because DA must define distribution manually.

DA studies layouts distribution in order to split documents into batches. The aim is to increase tagging efficiency, decrease mistakes, and contribute to consistency. Each batch of layouts should be labelled by subject matter experts (SMEs). A possible solution to split documents into layouts can be to sort them by key words.

A rational balance should be followed when splitting. The number of layouts should be limited and the quantity of documents per layout should be substantial (for example, 50 or more). The Data Analyst should check there is sufficient information about all fields or whether there is a need to increase the number of documents.

To collect the data set, the Data Analyst works with SMEs. Data Analyst provides training for SMEs, where DA explains rules of tagging and how to work with the system. As soon as documents are available and the logic in them is clarified, DA is to create a Human task, produce gold data and start qualification to teach SMEs what to tag. Then they start working in parallel: SMEs supply labelled documents, while DA checks quality and corrects mistakes, if there are any.

When data set is collected and checked, DA splits it into two parts: training set and test set. The Machine Learning Engineer starts model training and together with the Data Analyst they monitor model performance and results. If statistics fall below the required criteria, it’s necessary to investigate the reasons and create improvements. The reasons may be incorrect tagging, insufficient data set, or OCR mistakes. Possible solutions may be retagging documents and then model retraining and/or post-processing application. This step can be prolonged and repeated until the statistics figures match the required goal.

The Data Analyst prepares a final statistical report to provide the customer with results of model training, and final figures, which show its performance. This step closes the period of development, and the ML is ready for production.

Regardless of the type of Use Case, all Data Analyst responsibilities can be divided into three main stages:

Data set collection: The most considerable and important stage makes up more than half of all DA work.
Mistakes made and not corrected here may entail serious risks in terms of timelines and achievement of success criteria.
As a prerequisite to this stage, DAs should get familiar with different aspects of the project, clearly understand the business logic of the use case, what type is it (Information Extraction / Classification), what fields/classes will be used, document types and formats etc.
Model Training and analysis of results: This stage is implemented by close cooperation of DA and ML Engineer.
Preparation of the final report: At the end of the project, final results should be presented to the customer.