Learning to Instruct for Visual Instruction Tuning
- URL: http://arxiv.org/abs/2503.22215v2
- Date: Sat, 11 Oct 2025 07:54:17 GMT
- Title: Learning to Instruct for Visual Instruction Tuning
- Authors: Zhihan Zhou, Feng Hong, Jiaan Luo, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang,
- Abstract summary: We propose L2T, an advancement of visual instruction tuning (VIT)<n>L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences.<n>Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance.
- Score: 79.76580302615174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose L2T, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, L2T achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs. Github code: https://github.com/Feng-Hong/L2T.
Related papers
- MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering [36.80441487363007]
MLLMEraser is an input-aware, training-free framework for test-time unlearning.<n>We construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs.<n>Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines.
arXiv Detail & Related papers (2025-10-05T14:20:17Z) - True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z) - FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance [7.889590793589825]
We introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence.<n>We analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy.<n>FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
arXiv Detail & Related papers (2025-01-05T03:28:45Z) - Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning [125.79428219851289]
Inst-IT is a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning.
Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm.
arXiv Detail & Related papers (2024-12-04T18:58:10Z) - Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy [37.471419716572086]
There is a significant gap in instruction-following capabilities between Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
We propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap.
arXiv Detail & Related papers (2024-11-23T05:03:32Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning [16.873306091966693]
Visual instruction tuning enables large language models (MLLMs) to handle a wide range of vision tasks by framing them as language-based instructions.<n>We identify a dual form of catastrophic forgetting in CVIT, where MLLMs forget previously learned visual understanding and also experience a decline in instruction following abilities.<n>We introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules-one for visual understanding and another for instruction following.
arXiv Detail & Related papers (2024-11-21T09:00:15Z) - Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference.
We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers.
Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [48.455597568212944]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.<n>In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data.
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science [0.1499944454332829]
We evaluate the classification performance of large language models (LLMs) using in-context learning (ICL) and instruction tuning (IT)
ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates.
Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings.
arXiv Detail & Related papers (2024-09-23T02:43:08Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.<n>Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.<n>We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning [13.535110749767451]
We propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG)
Our method computes the information gain on masked parts to dynamically replay data and refine the training objective.
Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
arXiv Detail & Related papers (2024-03-15T06:54:20Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Vision-Language Instruction Tuning: A Review and Analysis [52.218690619616474]
Vision-Language Instruction Tuning (VLIT) presents more complex characteristics compared to pure text instruction tuning.
We offer a detailed categorization for existing VLIT datasets and identify the characteristics that high-quality VLIT data should possess.
By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs.
arXiv Detail & Related papers (2023-11-14T14:02:32Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.