Knowing Where and What: Unified Word Block Pretraining for Document
Understanding
- URL: http://arxiv.org/abs/2207.13979v2
- Date: Fri, 29 Jul 2022 12:35:37 GMT
- Title: Knowing Where and What: Unified Word Block Pretraining for Document
Understanding
- Authors: Song Tao, Zijian Wang, Tiantian Fan, Canjie Luo, Can Huang
- Abstract summary: We propose UTel, a language model with Unified TExt and layout pre-training.
Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks.
In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way.
- Score: 11.46378901674016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the complex layouts of documents, it is challenging to extract
information for documents. Most previous studies develop multimodal pre-trained
models in a self-supervised way. In this paper, we focus on the embedding
learning of word blocks containing text and layout information, and propose
UTel, a language model with Unified TExt and Layout pre-training. Specifically,
we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the
layout learning, and Contrastive learning of Word Embeddings (CWE) for
identifying different word blocks. Moreover, we replace the commonly used 1D
position embedding with a 1D clipped relative position embedding. In this way,
the joint training of Masked Layout-Language Modeling (MLLM) and two newly
proposed tasks enables the interaction between semantic and spatial features in
a unified way. Additionally, the proposed UTel can process arbitrary-length
sequences by removing the 1D position embedding, while maintaining competitive
performance. Extensive experimental results show UTel learns better joint
representations and achieves superior performance than previous methods on
various downstream tasks, though requiring no image modality. Code is available
at \url{https://github.com/taosong2019/UTel}.
Related papers
- LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training
for Document Understanding [7.7514466231699455]
This paper proposes a novel multi-modal pre-training model, LayoutMask.
It can enhance the interactions between text and layout modalities in a unified model.
It can achieve state-of-the-art results on a wide variety of VrDU problems.
arXiv Detail & Related papers (2023-05-30T03:56:07Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - BROS: A Layout-Aware Pre-trained Language Model for Understanding
Documents [13.293166441041238]
This paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), which effectively utilizes the information included in individual text blocks and their layouts.
BROS encodes spatial information by utilizing relative positions and learns dependencies between OCR blocks with a novel area-masking strategy.
arXiv Detail & Related papers (2021-08-10T09:30:23Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z) - LayoutLM: Pre-training of Text and Layout for Document Image
Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images.
This is the first time that text and layout are jointly learned in a single framework for document-level pre-training.
It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.