Related papers: Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

URL: http://arxiv.org/abs/2408.05019v1
Date: Fri, 9 Aug 2024 12:13:01 GMT
Title: Instruction Tuning-free Visual Token Complement for Multimodal LLMs
Authors: Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, Hanwang Zhang,
Abstract summary: multimodal large language models (MLLMs) have promised an elegant bridge between vision and language. We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features. Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
Score: 51.138806401996696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.

Related papers

Visual Jigsaw Post-Training Improves MLLMs [58.29961336087896]
We introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in large language models (MLLMs)<n>Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language.<n>Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding.
arXiv Detail & Related papers (2025-09-29T17:59:57Z)
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM) Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model. We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z)
Visual Adaptive Prompting for Compositional Zero-Shot Learning [0.0]
Vision-Language Models (VLMs) have demonstrated impressive multimodal capabilities in learning joint representations of visual and textual data.<n>CZSL requires models to generalize to novel combinations of visual primitives--such as attributes and objects--that were not explicitly encountered during training.<n>We propose a Visual Adaptive Prompting System (VAPS) that bridges the gap between semantic and visual features.
arXiv Detail & Related papers (2025-02-27T17:17:43Z)
Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z)
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning [57.35160715164359]
Visual-Predictive Instruction Tuning (VPiT) is a simple and effective extension to visual instruction tuning. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data. We train our MetaMorph model and achieve competitive performance on both visual understanding and generation.
arXiv Detail & Related papers (2024-12-18T18:58:50Z)
Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image. We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo) DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs) We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z)
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex [4.57590454144072]
Recently, there has been a surge in the popularity of pre trained large language models (LLMs) This paper proposes a new multi-modal training paradigm, aligning with LLM, encoding fMRI activity in visual cortex.
arXiv Detail & Related papers (2024-01-08T12:30:23Z)
VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training. VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning. In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.