Leveraging Clustering for Document Layout Analysis in Machine Learning Projects

Leveraging Clustering for Document Layout Analysis in Machine Learning Projects Informed

Image by Matthew Henry

Informed uses machine learning and AI to help lenders streamline their lending process, lower the cost of credit, and ensure that applications are fairly evaluated. And, we do this while making loans more accessible. The underpinning technology includes a wide array of AI models. In particular, deep learning models trained on large quantities of document images, text, or both. 

These models solve many different types of problems. These include optical character recognition (OCR), document classification, detecting field-value extraction, signatures, checkboxes, and other semantic analysis. In machine learning / AI projects, diverse document layouts pose a significant challenge. By leveraging clustering techniques, we can identify documents with similar layouts and selectively augment the training data, improving model performance.

Below, we will explore several concepts.  Among them – document layout analysis, the importance of targeted data augmentation, and how clustering reveals underperforming documents for augmentation.

Document Layout Analysis

Document layout analysis involves understanding the structure and organization of different document types. It plays a crucial role in OCR, information extraction, and document classification. Varied document layouts pose difficulties when training machine learning models, as they require different preprocessing for feature extraction techniques.

Targeted Data Augmentation

When a trained model underperforms, including random documents for additional training is inefficient. Targeted data augmentation selectively augments the training data with samples that specifically address the model’s weaknesses. By augmenting only the relevant documents, we improve the model’s performance without introducing unnecessary noise.

Utilizing Clustering for Document Layout Analysis

Clustering techniques provide a valuable approach for grouping similar documents based on their layout characteristics. These techniques enable us to identify clusters of documents with similar structures, formatting, or visual features. By applying clustering algorithms to the existing dataset, we can automatically group documents into distinct clusters, each representing a specific layout type.

Process Overview

Here is an overview of the process followed in the project:

  1. Image Feature Extraction:
    The VGG16 model extracted meaningful features from the document images. VGG16 is a popular deep learning model pre-trained on a large dataset that effectively extracts high-level image features.
  2. Clustering Using K-means:
    Those image features were the input for the K-means clustering algorithm. K-means is an unsupervised learning algorithm that groups similar data points into clusters, based on feature similarity. By applying K-means clustering to the image features, documents with similar layouts were grouped together, forming distinct clusters.
  3. Determining Optimal Number of Clusters:
    An elbow plot analysis highlighted the optimal number of clusters. The elbow plot identifies the number of clusters providing the most significant improvement in within-cluster similarity, while avoiding excessive fragmentation. The elbow plot typically displays the number of clusters on the x-axis and a measure of within-cluster variance (i.e. the sum of squared distances) on the y-axis. The “elbow” point on the plot indicates the number of clusters where the additional benefit of adding more clusters becomes marginal.
Leveraging Clustering for Document Layout Analysis in Machine Learning Projects Informed

Selective Data Augmentation Process

Once the documents are clustered, we focus on the clusters where the model underperforms. By analyzing misclassified or low-performing samples, we gain insights into the document layouts that challenge the model. Then we design targeted data augmentation strategies to address the weaknesses identified in the underperforming clusters. The below image shows the documents from a single cluster.

Leveraging Clustering for Document Layout Analysis in Machine Learning Projects Informed

Techniques for Data Augmentation

Depending on specific requirements and characteristics of the underperforming clusters, you’ll employ various data augmentation techniques. These include techniques such as geometric transformations, text perturbation, image manipulation, or layout modification. By augmenting the data within the relevant clusters, we provide the model with additional training samples that resemble the challenging documents.

Iterative Improvement

This iterative approach allows the model’s performance to continuously improve. By evaluating the model’s performance on the augmented data and retraining it, we refine the model’s ability to handle diverse document layouts. This ensures that the model becomes more robust and accurate.

Experiments

I applied a clustering approach to group similar documents types for a single class object detection task. To evaluate effectiveness, I conducted two experiments comparing the model’s performance with and without documents from a specific cluster (6). The results showed a significant difference in performance. The model trained with cluster 6 documents outperformed the other model, highlighting the importance of layout similarity in training.

It’s challenging to predict incoming traffic and varied document types in production. So, I employed a systematic approach of binning the documents based on layout similarity. By systematically selecting documents from different clusters, the model becomes more robust and adapts to various document layouts. This enables better performance even with unpredictable traffic. This clustering and document selection strategy provides a practical solution for handling diverse document layouts and ensures the model’s reliability.

Leveraging Clustering for Document Layout Analysis in Machine Learning Projects Informed

Conclusion

Targeted data augmentation is a powerful technique for enhancing model performance in ML projects with diverse document layouts. By leveraging clustering algorithms, we identify clusters of similar documents, enabling us to selectively augment the training data. 

This approach significantly improves the model’s handling of various document structures and leads to better overall performance. In a future study we’ll see if the efficacy of the approach is further improved by including the “bag of words” on a page along with image features.  We’ll obtain the words using a priori optical character recognition.

Document layout analysis plus targeted data augmentation provides a practical and efficient strategy for addressing underperforming models in specific classification tasks. By iteratively refining the model using the augmented data, we can achieve higher accuracy handling diverse document layouts.

We hope you’ve learned valuable insights into the importance of document layout analysis, targeted data augmentation, and the role of clustering in identifying underperforming documents. Stay tuned for more on machine learning and data analysis.

author avatar
Nikhil Kumar Marepally ML/AI Engineer
Nikhil Kumar Marepally is a ML/AI Engineer at Informed.IQ with over 5 years in the dynamic field of Machine Learning and Artificial Intelligence. Specialized in Computer Vision, and Natural Language Processing (NLP).

Just Released: Auto Loan Defect Industry Survey Report

X