Skip to main content

Classification/Information Extraction Problem Types

Classification/Information Extraction Problem Types


EasyRPA is used for but not limited to the following use cases:

  • Classification 
    • Classification is arrangement in groups or categories according to established criteria. Both binary and multi-class are supported.
    • Examples are email classification, documents identification, classification of pages within documents, approval process.
  • Information Extraction
    • Information Extraction is a process of extracting structured information (or key facts) from unstructured and/or semi-structured documents.
    • Example are invoice, claims, financial document details extraction.

Classification Problem Types

Document classification or document categorization is a problem in library science, information science and computer science.

A Classification use case is applied when it is necessary to define the class for the item (document). By class, we usually mean different document types. For example invoices, purchase orders and claims are processed in one workflow and each document type is handled differently. That means we need to classify these documents first before applying further automation. Classification ML algorithms are used to automatically classify texts by analyzing its parts (tokens) and their combinations (features).

From a modeling perspective, classification requires a training dataset with many examples of inputs and outputs from which to learn. A model will use the training dataset and will calculate how to best map examples of input data to specific class labels. As such, the training dataset must be sufficiently representative of the problem and have many examples of each class label.

There are perhaps four main types of classification tasks that you may encounter; they are:

  • Binary Classification
  • Multi-Class Classification
  • Multi-Label Classification
  • Imbalanced Classification

EasyRPA includes implementation of both Single-label Classification (binary or multi-class) and Multi-label Classification.

Binary classification

Binary classification is the task of classifying the elements of a given set into two groups (predicting which group each one belongs to)

Examples of Binary Classification:

  • Dividend Announcements (is given text related or not related to dividends)
  • Sentiment Analysis (positive or negative tweet analysis)
  • Title Classification (compare two titles of a person and tell whether they match or not)
  • Email spam detection (spam or not)
  • Conversion prediction (buy or not)

Typically, binary classification tasks involve one class that is the normal state and another class that is the abnormal state.

For example “not spam” is the normal state and “spam” is the abnormal state. Another example is “cancer not detected” is the normal state of a task that involves a medical test and “cancer detected” is the abnormal state.

Multi-class classification

Multi-class classification is very similar to the Binary Classification, the only difference is that you need to classify the elements of a given set into several groups (more than two).

Examples of Multi-class Classification:

  • Product Description (which product does the given text describe: Computers, Food, Clothes, Books)
  • Text Style Classification (identify the style of the given text: Romance, Thriller, Adventure, etc.)
  • Company News (match given news to one of the companies in the list: Apple, Microsoft, or Intel)
  • Face classification
  • Optical character recognition

Unlike binary classification, multi-class classification does not have the notion of normal and abnormal outcomes. Instead, examples are classified as belonging to one among a range of known classes. The number of class labels may be very large on some problems. For example, a model may predict a photo as belonging to one among thousands or tens of thousands of faces in a face recognition system.

Algorithms that are designed for binary classification can be adapted for use for multi-class problems. This involves using a strategy of fitting multiple binary classification models for each class vs. all other classes (called one-vs-rest) or one model for each pair of classes (called one-vs-one).

  • One-vs-Rest: Fit one binary classification model for each class vs. all other classes
  • One-vs-One: Fit one binary classification model for each pair of classes

Multi-label classification

Multi-label classification refers to those classification tasks that have two or more class labels, where one or more class labels may be predicted for each example.

Consider the example of photo classification, where a given photo may have multiple objects in the scene and a model may predict the presence of multiple known objects in the photo, such as “bicycle,” “apple,” “person,” etc. This is unlike binary classification and multi-class classification, where a single class label is predicted for each example.

Imbalanced classification

Imbalanced classification refers to classification tasks where the number of examples in each class is unequally distributed.

Typically, imbalanced classification tasks are binary classification tasks where the majority of examples in the training dataset belong to the normal class and a minority of examples belong to the abnormal class.

Examples include:

  • Fraud detection
  • Outlier detection
  • Medical diagnostic tests

These problems are modeled as binary classification tasks, although may require specialized techniques.

Specialized techniques may be used to change the composition of samples in the training dataset by undersampling the majority class or oversampling the minority class.

Information Extraction Problem Types

Information Extraction is a process of extracting structured information (or key facts) from unstructured and/or semi-structured documents (invoices, claims, dividend news, etc.).

An Information Extraction (IE) use case is used when data defined by business logic is taken out (extracted) from documents and processed according to business rules. In terms of Information Extraction, each use case data point is referred as a "field". For example, invoice numbersupplier name and quantity of products have to be extracted from all invoices. It means there are three fields (invoice number, supplier name and quantity) to be extracted from documents in this use case.

Information Extraction solves a variety of problems. Some examples include:

  • Convert scanned documents into digital text documents
  • Extract structured content from these documents (account ID, amount, currency, etc.)
  • Auto-filling web-forms (using data from passport scan, for example)