LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training
for Document Understanding
- URL: http://arxiv.org/abs/2305.18721v2
- Date: Fri, 9 Jun 2023 03:29:43 GMT
- Title: LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training
for Document Understanding
- Authors: Yi Tu, Ya Guo, Huan Chen, Jinyang Tang
- Abstract summary: This paper proposes a novel multi-modal pre-training model, LayoutMask.
It can enhance the interactions between text and layout modalities in a unified model.
It can achieve state-of-the-art results on a wide variety of VrDU problems.
- Score: 7.7514466231699455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visually-rich Document Understanding (VrDU) has attracted much research
attention over the past years. Pre-trained models on a large number of document
images with transformer-based backbones have led to significant performance
gains in this field. The major challenge is how to fusion the different
modalities (text, layout, and image) of the documents in a unified model with
different pre-training tasks. This paper focuses on improving text-layout
interactions and proposes a novel multi-modal pre-training model, LayoutMask.
LayoutMask uses local 1D position, instead of global 1D position, as layout
input and has two pre-training objectives: (1) Masked Language Modeling:
predicting masked tokens with two novel masking strategies; (2) Masked Position
Modeling: predicting masked 2D positions to improve layout representation
learning. LayoutMask can enhance the interactions between text and layout
modalities in a unified model and produce adaptive and robust multi-modal
representations for downstream tasks. Experimental results show that our
proposed method can achieve state-of-the-art results on a wide variety of VrDU
problems, including form understanding, receipt understanding, and document
image classification.
Related papers
- StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - Knowing Where and What: Unified Word Block Pretraining for Document
Understanding [11.46378901674016]
We propose UTel, a language model with Unified TExt and layout pre-training.
Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks.
In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way.
arXiv Detail & Related papers (2022-07-28T09:43:06Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition [38.08486689940946]
Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention.
It is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality.
In this paper, we propose bf Image-bf text bf Alignments (ITA) to align image features into the textual space.
arXiv Detail & Related papers (2021-12-13T08:29:43Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.