VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning
- URL: http://arxiv.org/abs/2603.01195v1
- Date: Sun, 01 Mar 2026 17:26:02 GMT
- Title: VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning
- Authors: Mingkang Dong, Hongyi Cai, Jie Li, Sifan Zhou, Bin Ren, Kunyu Peng, Yuqian Fu,
- Abstract summary: We propose a principled data selection framework that measures the marginal contribution of visual input during instruction tuning.<n>By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned.<n>Across 10 benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance.
- Score: 33.115992843637564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.
Related papers
- ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning [18.989158560585675]
Training on large-scale datasets is computationally expensive and inefficient due to redundancy in the data.<n>We propose ScalSelect, a training-free multimodal data selection method with linear-time complexity.<n>ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings.
arXiv Detail & Related papers (2026-02-12T06:38:49Z) - CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization [14.304308878028358]
Multimodal large language models rely heavily on instruction tuning to align vision and language capabilities.<n>Existing data selection methods aim to select important and diverse subsets, but they often suffer from two critical drawbacks.<n>We introduce CoIDO, a novel dual-objective framework that jointly optimize data importance and diversity to overcome these challenges.
arXiv Detail & Related papers (2025-10-11T09:41:21Z) - $Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation [1.9911692005669095]
Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs)<n>VIF also requires multimodal data to enable joint visual and textual understanding.<n>$Delta$-AttnMask quantifies sample quality through attention-guided masking of the model's hidden states.<n>$Delta$-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy.
arXiv Detail & Related papers (2025-08-08T13:25:30Z) - MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning.<n>We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance.<n>Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z) - Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [50.492124556982674]
This paper introduces a novel choice-based sample selection framework.<n>It shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>We validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
arXiv Detail & Related papers (2025-03-04T07:32:41Z) - Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [63.484378941471114]
We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.<n>Experiments on various benchmarks demonstrate that DataTailor achieves 101.3% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z) - Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.