Input Documents Formats

Understanding document format is a crucial step when choosing the right approach for document processing. Customer’s documents come from various sources and represent different kinds in terms of templates, style, formatting, and sometimes language. Choosing the right technique to extract the data from these documents can be challenging. We can use rule-based approach as well as model-based approach to handle data from different document structures to address this problem. The choice between rule-based or machine-learning approach depends on the type of document and the way it is structured.

Structured Documents.

The layouts and templates are usually fixed and well-structured in this type of documents. Structured data adheres to a data model defining how data can be stored, processed and accessed. It conforms to a tabular format with relationship between the different rows and columns. A pre-defined data model makes structured data very powerful since the structure is straightforward to analyze and the data can be quickly aggregated.

Common examples of structured data are EXCEL files or SQL databases. Each of these have structured rows and columns that can be sorted.

The recommended approach is to use rule-based RPA engine to extract the information from structured documents, which can include regular expressions or simple position mapping and OCR. To integrate software robots and automate information extraction, either pre-existing templates can be used or specific rules for the customer’s structured data can be created. The disadvantage with the rule-based approach is that since it relies on fixed parts, even minor changes in form structure can cause rules to break down.

A machine learning approach can also be used if the customer has many different templates of Excel documents. It is best to split huge Excel files into several documents by 50-100 rows before converting them to the right format to be processed in a human task.

Excel files need to be converted an appropriate format before sending the documents to Human Task.

Semi-Structured Documents.

Semi-structured data does not have the formal structure of data models associated with data tables, but contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. These documents can have the same information but the information can arranged in different positions. For example, invoices can contain identical fields. But in a few invoices, the seller’s address can be located at the top, and in others, it can be found at the bottom.

Examples of semi-structured data include HTML, JSON and XML files.

With semi-structured data both rule-based and machine learning approaches can be used. If customer has a well structured HTML document the best solution would probably be to use rule-based approach to extract data from the documents.

If the structure of the HTML document is unknown rule-based approach will not give high accuracies and we need to bring machine learning models into the picture for information extraction. For example, you need to extract data from emails that weren’t auto-generated but written by people without any predefined template.

Unstructured Documents.

Unlike in structured and semi-structured documents, there is no well-defined repetitive structure in unstructured data. It doesn’t always have key-values pairs. For example, in invoices a seller’s address can be without any key name and the same can be observed for other fields like date, invoice number, etc. Most business documents are unstructured.

Common examples of unstructured data are:

Multimedia files: images, audio, and video files are all unstructured. Besides, multimedia can come in multiple format files, produced through various means. For example, an image can be TIFF, JPEG, GIF, PNG, or RAW, each with their own characteristics
Text files: Almost all traditional business files, including your Word processing documents, presentations, notes, and PDFs, are unstructured data.

PDF documents can be categorized into three different types, depending on the way they were originated:

Digitally created PDFs. These PDFs are created using software such as Microsoft Word, Excel or via the “print” function within a software application. They consist of text and images and can be easily searched and edited similar to other editable formats like Microsoft Word.
Scanned PDFs. These type of PDFs are “image-only”. They are created by scanning a document on office scanners of converting image, jpg, tiff or screenshot into a PDF, the content is “locked” in a snapshot-like image. Such PDFs are not searchable and their text cannot be modified or marked up. An “image-only” PDF can be made searchable by applying OCR with which a text layer is added.
Searchable PDFs. Searchable PDFs usually result through the application of Optical Character Recognition to scanned PDFs or other image-based documents. During the text recognition process, characters and the document structure are analyzed. A text layer is added to the image layer, usually placed underneath. Such PDF files are fully searchable and their text can be selected, copied, and marked up.

Rule-based approach is not recommended to manage unstructured data in most cases. For ML models to accurately process unstructured data, the data needs to be converted to structured data using OCR (Optical Character Recognition) and written text needs to be translated into actionable data like an email, phone number, address, etc. using data annotation. The model then will learn what values should be extracted as phone numbers, email addresses, etc.

Document processing workflow varies depending on the type of unstructured data. Image files and scanned PDFs need to go through OCR step first, whereas digitally created and searchable PDFs can be processed with Apache PDFBox for data extraction. Plain text documents are the most difficult type for data processing since they don’t contain any additional metadata to assist in data interpretation. Although machine-learning approach is generally recommended with plain text files rule-based approach can also be sometimes used when the order of data appearance is known. For example, we know that the first word in an invoice is an invoice number.