SelfDoc: Self-Supervised Document Representation Learning
- URL: http://arxiv.org/abs/2106.03331v1
- Date: Mon, 7 Jun 2021 04:19:49 GMT
- Title: SelfDoc: Self-Supervised Document Representation Learning
- Authors: Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao,
Rajiv Jain, Varun Manjunatha, Hongfu Liu
- Abstract summary: SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
- Score: 46.22910270334824
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose SelfDoc, a task-agnostic pre-training framework for document image
understanding. Because documents are multimodal and are intended for sequential
reading, our framework exploits the positional, textual, and visual information
of every semantically meaningful component in a document, and it models the
contextualization between each block of content. Unlike existing document
pre-training models, our model is coarse-grained instead of treating individual
words as input, therefore avoiding an overly fine-grained with excessive
contextualization. Beyond that, we introduce cross-modal learning in the model
pre-training phase to fully leverage multimodal information from unlabeled
documents. For downstream usage, we propose a novel modality-adaptive attention
mechanism for multimodal feature fusion by adaptively emphasizing language and
vision signals. Our framework benefits from self-supervised pre-training on
documents without requiring annotations by a feature masking training strategy.
It achieves superior performance on multiple downstream tasks with
significantly fewer document images used in the pre-training stage compared to
previous works.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - Hierarchical Multimodal Pre-training for Visually Rich Webpage
Understanding [22.00873805952277]
WebLM is a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages.
We propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively.
Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks.
arXiv Detail & Related papers (2024-02-28T11:50:36Z) - DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents [18.080447065002392]
We propose DocumentCLIP to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents.
Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content.
arXiv Detail & Related papers (2023-06-09T23:51:11Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - DOC2PPT: Automatic Presentation Slides Generation from Scientific
Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation.
We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner.
Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.