Architecting AI for Documents: A Deep Dive into LayoutLM

Architecting AI for Documents A Deep Dive into LayoutLM Informed
AI generated image from Midjourney

Informed’s document intelligence and lending solutions help lenders board more loans faster, with less risk and fraud. Tools like real time income calculation, fraud detection and document verifications enable lenders to increase their application volume and drive more revenue. Our engineers work hard on building this robust and unbiased vertical document AI solution.

At the heart of the document AI is an open-source model called LayoutLM. It’s a transformer-based model that parses layout and extracts key information for various documents, especially scanned forms. In this blog, we’ll dive into the workings of LayoutLM, including the model’s architecture and how it is pre-trained to support text and image as inputs for richer document understanding. We’ll specifically look at LayoutLMv3 – the latest iteration in the series.


LayoutLMv3 is a multimodal pre-trained model developed by Microsoft Research and built on foundations of the Transformer. It has a BERT-like architecture and doesn’t rely on a pre-trained CNN or Faster R-CNN backbone to extract visual features. The training objectives (discussed later) ensure that the text and image representations of a document’s page are learned jointly and facilitate cross modal-alignment learning.

Architecting AI for Documents A Deep Dive into LayoutLM Informed
Comparison with existing multi modal models Unlike DocFormer and SelfDoc LayoutLMv3 doesnt use CNN to extract image features helping with the computational bottleneck

Model Architecture

  • Encoder-only vs decoder-only models: Although the model is inspired by the Transformer, there is a key difference between their architectures. The Transformer has both encoders and decoders whereas LayoutLMv3 is an encoder-only model. For Natural Language Understanding (NLU) tasks, for which LayoutLMv3 is primarily designed, the decoder is not necessary. On the other hand, Natural Language Generation (NLG) tasks are handled well by decoder-only models. All the LLMs (ChatGPT, Llama, Mistral, etc.) fall into this category.
  • It’s a multilayer transformer with each layer consisting of multi-head self-attention and position-wise, fully connected feed-forward networks. The last layer outputs text-and-image contextual representations.
  • Text Embeddings: Every document passes through the OCR engine to get the textual content. We also collect the bounding box coordinates of each word. The input text embeddings have 3 components: word embeddings, 1-D position embedding (position of the word in the sentence) and 2-D position embedding (location of the word in the layout).
  • Image Embeddings: This is where LayoutLMv3 differs from its rivals – DocFormer, SelfDoc etc. Those models use either CNN features or Faster-CNN for image embeddings, which becomes a computational bottleneck. LayoutLMv3, like ViT, projects the image into a discrete token space.
Architecting AI for Documents A Deep Dive into LayoutLM Informed
Model architecture and pre training objectives for LayoutLMv3 1 D and 2 D position embeddings are added to each word token to encode layout information Word and image tokens are masked randomly for pre training The word patch alignment objective aligns both the modalities for richer document understanding


The authors have used pre-training objectives to facilitate seamless cross-modality learning.

  • Masked Language Modeling (MLM): It’s a common practice to train language models. 30% of text tokens are masked while training by replacing them with [MASK]. The objective is to predict the masked tokens using the rest of the tokens (image and text) as context.
  • Masked Image Modeling (MIM): Similar to MLM, 40% of the image tokens are masked and the model has to reconstruct them based on the available image and text tokens.
  • Word Patch Alignment (WPA): Each text token in the input has corresponding image tokens. The objective of WPA is to predict whether for a text token the corresponding image tokens are masked or not. This helps in aligning the modalities.

Model configuration

There are two versions of the model – base and large. The base has 12 layers of the transformer encoder each with 12 attention heads, a hidden size of 768 and 3072 intermediate size of feed-forward networks. Large has 24 layers of the transformer encoder each with 16 attention heads, 1024 hidden size and 4096 intermediate size of feed forward network.

Architecting AI for Documents A Deep Dive into LayoutLM Informed
Comparison of LayoutLMv3s performance of various fine tuning datasets against its competitors Numbers for both the base and the large models are presented LayoutLMv3 outperforms the other models in almost all tasks

Fine-tuning for token classification

LayoutLMv3 can be fine-tuned for multiple downstream tasks – form understanding, document image classification and visual question and answering. At Informed, we use the model for form understanding to extract critical information from a document. Each token is classified as one of the pre-defined classes – applicant name, address, etc. This is done by passing the representation of each token through a classification layer and then through a softmax layer. All the parameters are fine-tuned end to end.

We use our customers’ real-world [permissioned] data to enhance the capability of the models. Depending on the complexity of the document, we annotate enough pages with the information that we want to extract by labeling relevant information and marking their bounding box. We train the model until the training loss plateaus and we analyze the precision and recall scores. This is discussed in further detail in the paper, “Accuracy and Automation SLAs are Coming Your Way.”

Like most transformer models, LayoutLMv3 has a limitation of only 512 tokens per inference. We circumvent this limitation by adopting some creative techniques – token filtering, sliding window, etc.

Overall, LayoutLMv3 has progressed the field of document AI tremendously and is crucial to Informed’s products. Having open-source models like this democratizes AI and ensures a level playing field for smaller players.

author avatar
Shikhar Gupta
Shikhar is a Machine Learning engineer at Informed with over 5 years of experience in tech. His expertise lies in creating cutting-edge AI solutions that drive efficiency and intelligence in Informed's products.

New: American Banker - Sharing information = best defense against AI fraud