LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding
- URL: http://arxiv.org/abs/2012.14740v1
- Date: Tue, 29 Dec 2020 13:01:52 GMT
- Title: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding
- Authors: Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang,
Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
- Abstract summary: Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
- Score: 49.941806975280045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training of text and layout has proved effective in a variety of
visually-rich document understanding tasks due to its effective model
architecture and the advantage of large-scale unlabeled scanned/digital-born
documents. In this paper, we present \textbf{LayoutLMv2} by pre-training text,
layout and image in a multi-modal framework, where new model architectures and
pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the
existing masked visual-language modeling task but also the new text-image
alignment and text-image matching tasks in the pre-training stage, where
cross-modality interaction is better learned. Meanwhile, it also integrates a
spatial-aware self-attention mechanism into the Transformer architecture, so
that the model can fully understand the relative positional relationship among
different text blocks. Experiment results show that LayoutLMv2 outperforms
strong baselines and achieves new state-of-the-art results on a wide variety of
downstream visually-rich document understanding tasks, including FUNSD (0.7895
-> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA
(0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672).
Related papers
- PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training
for Document Understanding [7.7514466231699455]
This paper proposes a novel multi-modal pre-training model, LayoutMask.
It can enhance the interactions between text and layout modalities in a unified model.
It can achieve state-of-the-art results on a wide variety of VrDU problems.
arXiv Detail & Related papers (2023-05-30T03:56:07Z) - DUBLIN -- Document Understanding By Language-Image Network [37.42637168606938]
We propose DUBLIN, which is pretrained on web pages using three novel objectives.
We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset.
We also achieve competitive performance on RVL-CDIP document classification.
arXiv Detail & Related papers (2023-05-23T16:34:09Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - LayoutLM: Pre-training of Text and Layout for Document Image
Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images.
This is the first time that text and layout are jointly learned in a single framework for document-level pre-training.
It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.