Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning
- URL: http://arxiv.org/abs/2009.14457v2
- Date: Wed, 5 Jan 2022 11:37:22 GMT
- Title: Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning
- Authors: Subhojeet Pramanik, Shashank Mujumdar, Hima Patel
- Abstract summary: We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
- Score: 5.109216329453963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent approaches in literature have exploited the multi-modal information in
documents (text, layout, image) to serve specific downstream document tasks.
However, they are limited by their - (i) inability to learn cross-modal
representations across text, layout and image dimensions for documents and (ii)
inability to process multi-page documents. Pre-training techniques have been
shown in Natural Language Processing (NLP) domain to learn generic textual
representations from large unlabelled datasets, applicable to various
downstream NLP tasks. In this paper, we propose a multi-task learning-based
framework that utilizes a combination of self-supervised and supervised
pre-training tasks to learn a generic document representation applicable to
various downstream document tasks. Specifically, we introduce Document Topic
Modelling and Document Shuffle Prediction as novel pre-training tasks to learn
rich image representations along with the text and layout representations for
documents. We utilize the Longformer network architecture as the backbone to
encode the multi-modal information from multi-page documents in an end-to-end
fashion. We showcase the applicability of our pre-training framework on a
variety of different real-world document tasks such as document classification,
document information extraction, and document retrieval. We evaluate our
framework on different standard document datasets and conduct exhaustive
experiments to compare performance against various ablations of our framework
and state-of-the-art baselines.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.