ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding
- URL: http://arxiv.org/abs/2210.06155v2
- Date: Fri, 14 Oct 2022 06:54:17 GMT
- Title: ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding
- Authors: Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie
Huang, Teng Hu, Weichong Yin, Yongfeng Chen, Yin Zhang, Shikun Feng, Yu Sun,
Hao Tian, Hua Wu, Haifeng Wang
- Abstract summary: We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
- Score: 52.3895498789521
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed the rise and success of pre-training techniques
in visually-rich document understanding. However, most existing methods lack
the systematic mining and utilization of layout-centered knowledge, leading to
sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel
document pre-training solution with layout knowledge enhancement in the whole
workflow, to learn better representations that combine the features from text,
layout, and image. Specifically, we first rearrange input sequences in the
serialization stage, and then present a correlative pre-training task, reading
order prediction, to learn the proper reading order of documents. To improve
the layout awareness of the model, we integrate a spatial-aware disentangled
attention into the multi-modal transformer and a replaced regions prediction
task into the pre-training phase. Experimental results show that ERNIE-Layout
achieves superior performance on various downstream tasks, setting new
state-of-the-art on key information extraction, document image classification,
and document question answering datasets. The code and models are publicly
available at
http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.
Related papers
- Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - A Span Extraction Approach for Information Extraction on Visually-Rich
Documents [2.3131309703965135]
We present a new approach to improve the capability of language model pre-training on visually-rich documents (VRDs)
Firstly, we introduce a new IE model that is query-based and employs the span extraction formulation instead of the commonly used sequence labelling approach.
We also propose a new training task which focuses on modelling the relationships between semantic entities within a document.
arXiv Detail & Related papers (2021-06-02T06:50:04Z) - DOC2PPT: Automatic Presentation Slides Generation from Scientific
Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation.
We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner.
Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z) - Multiple Document Datasets Pre-training Improves Text Line Detection
With Deep Neural Networks [2.5352713493505785]
We introduce a fully convolutional network for the document layout analysis task.
Our method Doc-UFCN relies on a U-shaped model trained from scratch for detecting objects from historical documents.
We show that Doc-UFCN outperforms state-of-the-art methods on various datasets.
arXiv Detail & Related papers (2020-12-28T09:48:33Z) - Learning from similarity and information extraction from structured
documents [0.0]
The aim is to improve micro F1 of per-word classification on a huge real-world document dataset.
Results confirm that all proposed architecture parts are all required to beat the previous results.
The best model improves the previous state-of-the-art results by an 8.25 gain in F1 score.
arXiv Detail & Related papers (2020-10-17T21:34:52Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.