Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models
- URL: http://arxiv.org/abs/2402.19014v1
- Date: Thu, 29 Feb 2024 10:17:27 GMT
- Title: Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models
- Authors: Xin Li, Yunfei Wu, Xinghua Jiang, Zhihao Guo, Mingming Gong, Haoyu
Cao, Yinsong Liu, Deqiang Jiang, Xing Sun
- Abstract summary: We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
- Score: 56.76307866160105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, the advent of Large Visual-Language Models (LVLMs) has received
increasing attention across various domains, particularly in the field of
visual document understanding (VDU). Different from conventional
vision-language tasks, VDU is specifically concerned with text-rich scenarios
containing abundant document elements. Nevertheless, the importance of
fine-grained features remains largely unexplored within the community of LVLMs,
leading to suboptimal performance in text-rich scenarios. In this paper, we
abbreviate it as the fine-grained feature collapse issue. With the aim of
filling this gap, we propose a contrastive learning framework, termed Document
Object COntrastive learning (DoCo), specifically tailored for the downstream
tasks of VDU. DoCo leverages an auxiliary multimodal encoder to obtain the
features of document objects and align them to the visual features generated by
the vision encoder of LVLM, which enhances visual representation in text-rich
scenarios. It can represent that the contrastive learning between the visual
holistic representations and the multimodal fine-grained features of document
objects can assist the vision encoder in acquiring more effective visual cues,
thereby enhancing the comprehension of text-rich documents in LVLMs. We also
demonstrate that the proposed DoCo serves as a plug-and-play pre-training
method, which can be employed in the pre-training of various LVLMs without
inducing any increase in computational complexity during the inference process.
Extensive experimental results on multiple benchmarks of VDU reveal that LVLMs
equipped with our proposed DoCo can achieve superior performance and mitigate
the gap between VDU and generic vision-language tasks.
Related papers
- Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding [41.43688559565315]
We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs)
Our approach employs multi-scale visual features to handle various font sizes within document images.
We introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text.
arXiv Detail & Related papers (2024-11-08T00:58:12Z) - Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - HRVDA: High-Resolution Visual Document Assistant [32.51417315241559]
We propose a High-Resolution Visual Document Assistant (HRVDA) to bridge the gap between MLLMs and visual document understanding.
HRVDA employs a content filtering mechanism and an instruction filtering module to filter out the content-agnostic visual tokens and instruction-agnostic visual tokens.
Our model achieves state-of-the-art performance across multiple document understanding datasets.
arXiv Detail & Related papers (2024-04-10T11:10:50Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.