LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking
- URL: http://arxiv.org/abs/2204.08387v2
- Date: Tue, 19 Apr 2022 15:55:02 GMT
- Title: LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking
- Authors: Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei
- Abstract summary: We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
- Score: 83.09001231165985
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised pre-training techniques have achieved remarkable progress in
Document AI. Most multimodal pre-trained models use a masked language modeling
objective to learn bidirectional representations on the text modality, but they
differ in pre-training objectives for the image modality. This discrepancy adds
difficulty to multimodal representation learning. In this paper, we propose
LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified
text and image masking. Additionally, LayoutLMv3 is pre-trained with a
word-patch alignment objective to learn cross-modal alignment by predicting
whether the corresponding image patch of a text word is masked. The simple
unified architecture and training objectives make LayoutLMv3 a general-purpose
pre-trained model for both text-centric and image-centric Document AI tasks.
Experimental results show that LayoutLMv3 achieves state-of-the-art performance
not only in text-centric tasks, including form understanding, receipt
understanding, and document visual question answering, but also in
image-centric tasks such as document image classification and document layout
analysis. The code and models are publicly available at
https://aka.ms/layoutlmv3.
Related papers
- DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents [18.080447065002392]
We propose DocumentCLIP to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents.
Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content.
arXiv Detail & Related papers (2023-06-09T23:51:11Z) - LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training
for Document Understanding [7.7514466231699455]
This paper proposes a novel multi-modal pre-training model, LayoutMask.
It can enhance the interactions between text and layout modalities in a unified model.
It can achieve state-of-the-art results on a wide variety of VrDU problems.
arXiv Detail & Related papers (2023-05-30T03:56:07Z) - StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.