Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich
Document Understanding
- URL: http://arxiv.org/abs/2206.13155v1
- Date: Mon, 27 Jun 2022 09:58:34 GMT
- Title: Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich
Document Understanding
- Authors: Chuwei Luo, Guozhi Tang, Qi Zheng, Cong Yao, Lianwen Jin, Chenliang
Li, Yang Xue, Luo Si
- Abstract summary: Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks.
The way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy.
In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals.
- Score: 72.95838931445498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal document pre-trained models have proven to be very effective in a
variety of visually-rich document understanding (VrDU) tasks. Though existing
document pre-trained models have achieved excellent performance on standard
benchmarks for VrDU, the way they model and exploit the interactions between
vision and language on documents has hindered them from better generalization
ability and higher accuracy. In this work, we investigate the problem of
vision-language joint representation learning for VrDU mainly from the
perspective of supervisory signals. Specifically, a pre-training paradigm
called Bi-VLDoc is proposed, in which a bidirectional vision-language
supervision strategy and a vision-language hybrid-attention mechanism are
devised to fully explore and utilize the interactions between these two
modalities, to learn stronger cross-modal document representations with richer
semantics. Benefiting from the learned informative cross-modal document
representations, Bi-VLDoc significantly advances the state-of-the-art
performance on three widely-used document understanding benchmarks, including
Form Understanding (from 85.14% to 93.44%), Receipt Information Extraction
(from 96.01% to 97.84%), and Document Classification (from 96.08% to 97.12%).
On Document Visual QA, Bi-VLDoc achieves the state-of-the-art performance
compared to previous single model methods.
Related papers
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding [91.17151775296234]
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding.
Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space.
arXiv Detail & Related papers (2023-11-20T14:42:25Z) - GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification [8.880856137902947]
We introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner.
GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations.
For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Classification (DIC) and Content-based Document Image Retrieval (DIR)
arXiv Detail & Related papers (2023-09-11T18:35:14Z) - UniDoc: A Universal Large Multimodal Model for Simultaneous Text
Detection, Recognition, Spotting and Understanding [93.92313947913831]
We introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities.
To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
arXiv Detail & Related papers (2023-08-19T17:32:34Z) - VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal
Document Classification [3.7798600249187295]
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task.
In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues.
The proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities.
arXiv Detail & Related papers (2022-05-24T12:28:12Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - Integrating Semantics and Neighborhood Information with Graph-Driven
Generative Models for Document Retrieval [51.823187647843945]
In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model.
Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones.
arXiv Detail & Related papers (2021-05-27T11:29:03Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.