TransferDoc: A Self-Supervised Transferable Document Representation
Learning Model Unifying Vision and Language
- URL: http://arxiv.org/abs/2309.05756v1
- Date: Mon, 11 Sep 2023 18:35:14 GMT
- Title: TransferDoc: A Self-Supervised Transferable Document Representation
Learning Model Unifying Vision and Language
- Authors: Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickael Coustaty,
Mar\c{c}al Rusi\~nol, Oriol Ramos Terrades, Josep Llad\'os
- Abstract summary: TransferDoc is a cross-modal transformer-based architecture pre-trained in a self-supervised fashion.
It learns richer semantic concepts by unifying language and visual representations.
It outperforms other state-of-the-art approaches in a closer-to-real'' industrial evaluation scenario.
- Score: 4.629032441868536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of visual document understanding has witnessed a rapid growth in
emerging challenges and powerful multi-modal strategies. However, they rely on
an extensive amount of document data to learn their pretext objectives in a
``pre-train-then-fine-tune'' paradigm and thus, suffer a significant
performance drop in real-world online industrial settings. One major reason is
the over-reliance on OCR engines to extract local positional information within
a document page. Therefore, this hinders the model's generalizability,
flexibility and robustness due to the lack of capturing global information
within a document image. We introduce TransferDoc, a cross-modal
transformer-based architecture pre-trained in a self-supervised fashion using
three novel pretext objectives. TransferDoc learns richer semantic concepts by
unifying language and visual representations, which enables the production of
more transferable models. Besides, two novel downstream tasks have been
introduced for a ``closer-to-real'' industrial evaluation scenario where
TransferDoc outperforms other state-of-the-art approaches.
Related papers
- LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach [9.643486775455841]
This paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration systems.
We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content.
We evaluate our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study.
arXiv Detail & Related papers (2024-06-12T19:41:01Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich
Document Understanding [72.95838931445498]
Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks.
The way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy.
In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals.
arXiv Detail & Related papers (2022-06-27T09:58:34Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - SMDT: Selective Memory-Augmented Neural Document Translation [53.4627288890316]
We propose a Selective Memory-augmented Neural Document Translation model to deal with documents containing large hypothesis space of context.
We retrieve similar bilingual sentence pairs from the training corpus to augment global context.
We extend the two-stream attention model with selective mechanism to capture local context and diverse global contexts.
arXiv Detail & Related papers (2022-01-05T14:23:30Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - ERNIE-DOC: The Retrospective Long-Document Modeling Transformer [24.426571160930635]
We propose ERNIE-DOC, a document-level language pretraining model based on Recurrence Transformers.
Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism enable ERNIE-DOC with much longer effective context length.
Various experiments on both English and Chinese document-level tasks are conducted.
arXiv Detail & Related papers (2020-12-31T16:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.