Unified Pretraining Framework for Document Understanding
- URL: http://arxiv.org/abs/2204.10939v1
- Date: Fri, 22 Apr 2022 21:47:04 GMT
- Title: Unified Pretraining Framework for Document Understanding
- Authors: Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Nikolaos
Barmpalios, Rajiv Jain, Ani Nenkova, Tong Sun
- Abstract summary: We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
- Score: 52.224359498792836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document intelligence automates the extraction of information from documents
and supports many business applications. Recent self-supervised learning
methods on large-scale unlabeled document datasets have opened up promising
directions towards reducing annotation efforts by training models with
self-supervised objectives. However, most of the existing document pretraining
methods are still language-dominated. We present UDoc, a new unified
pretraining framework for document understanding. UDoc is designed to support
most document understanding tasks, extending the Transformer to take multimodal
embeddings as input. Each input element is composed of words and visual
features from a semantic region of the input document image. An important
feature of UDoc is that it learns a generic representation by making use of
three self-supervised losses, encouraging the representation to model
sentences, learn similarities, and align modalities. Extensive empirical
analysis demonstrates that the pretraining procedure learns better joint
representations and leads to improvements in downstream tasks.
Related papers
- DoPTA: Improving Document Layout Analysis using Patch-Text Alignment [3.3181276611945267]
We present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks.
Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during inference.
DoPTA also sets new state-of-the art results on D4LA, and FUNSD, two challenging document visual analysis benchmarks.
arXiv Detail & Related papers (2024-12-17T13:26:31Z) - DocFusion: A Unified Framework for Document Parsing Tasks [22.916911092946897]
DocFusion is a lightweight generative model with only 0.28B parameters.
It unifies task representations and achieves collaborative training through an improved objective function.
arXiv Detail & Related papers (2024-12-17T03:20:00Z) - Unified Multimodal Interleaved Document Representation for Retrieval [57.65409208879344]
We propose a method that holistically embeds documents interleaved with multiple modalities.
We merge the representations of segmented passages into one single document representation.
We show that our approach substantially outperforms relevant baselines.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich
Document Understanding [72.95838931445498]
Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks.
The way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy.
In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals.
arXiv Detail & Related papers (2022-06-27T09:58:34Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.