XDoc: Unified Pre-training for Cross-Format Document Understanding
- URL: http://arxiv.org/abs/2210.02849v1
- Date: Thu, 6 Oct 2022 12:07:18 GMT
- Title: XDoc: Unified Pre-training for Cross-Format Document Understanding
- Authors: Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei
- Abstract summary: XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
- Score: 84.63416346227176
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The surge of pre-training has witnessed the rapid development of document
understanding recently. Pre-training and fine-tuning framework has been
effectively used to tackle texts in various formats, including plain texts,
document texts, and web texts. Despite achieving promising performance,
existing pre-trained models usually target one specific document format at one
time, making it difficult to combine knowledge from multiple document formats.
To address this, we propose XDoc, a unified pre-trained model which deals with
different document formats in a single model. For parameter efficiency, we
share backbone parameters for different formats such as the word embedding
layer and the Transformer layers. Meanwhile, we introduce adaptive layers with
lightweight parameters to enhance the distinction across different formats.
Experimental results have demonstrated that with only 36.7% parameters, XDoc
achieves comparable or even better performance on a variety of downstream tasks
compared with the individual pre-trained models, which is cost effective for
real-world deployment. The code and pre-trained models will be publicly
available at \url{https://aka.ms/xdoc}.
Related papers
- DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.301481927603554]
We introduce Doc-YOLO, a novel approach that enhances accuracy while maintaining speed advantages.
For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm.
In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module.
arXiv Detail & Related papers (2024-10-16T14:50:47Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Value Retrieval with Arbitrary Queries for Form-like Documents [50.5532781148902]
We propose value retrieval with arbitrary queries for form-like documents.
Our method predicts target value for an arbitrary query based on the understanding of layout and semantics of a form.
We propose a simple document language modeling (simpleDLM) strategy to improve document understanding on large-scale model pre-training.
arXiv Detail & Related papers (2021-12-15T01:12:02Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - ERNIE-DOC: The Retrospective Long-Document Modeling Transformer [24.426571160930635]
We propose ERNIE-DOC, a document-level language pretraining model based on Recurrence Transformers.
Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism enable ERNIE-DOC with much longer effective context length.
Various experiments on both English and Chinese document-level tasks are conducted.
arXiv Detail & Related papers (2020-12-31T16:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.