Related papers: $Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

$Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

URL: http://arxiv.org/abs/2508.09199v1
Date: Fri, 08 Aug 2025 13:25:30 GMT
Title: $Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation
Authors: Jucheng Hu, Suorong Yang, Dongzhan Zhou,
Abstract summary: Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs)<n>VIF also requires multimodal data to enable joint visual and textual understanding.<n>$Delta$-AttnMask quantifies sample quality through attention-guided masking of the model's hidden states.<n>$Delta$-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy.
Score: 1.9911692005669095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose $\Delta$-AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model's hidden states, jointly evaluating image-text pairs without requiring domain labels, auxiliary models, or extra training. By computing loss differences ($\Delta$) between the original states and states masked using high-attention regions, $\Delta$-AttnMask intrinsically assesses sample quality. Experiments across multiple VLMs and datasets show that $\Delta$-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy. Its model-agnostic and data-agnostic design ensures broad applicability across modalities and architectures.

Related papers

VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning [33.115992843637564]
We propose a principled data selection framework that measures the marginal contribution of visual input during instruction tuning.<n>By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned.<n>Across 10 benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance.
arXiv Detail & Related papers (2026-03-01T17:26:02Z)
ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning [18.989158560585675]
Training on large-scale datasets is computationally expensive and inefficient due to redundancy in the data.<n>We propose ScalSelect, a training-free multimodal data selection method with linear-time complexity.<n>ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings.
arXiv Detail & Related papers (2026-02-12T06:38:49Z)
Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring [26.174094671736686]
We propose a novel quality-driven data selection pipeline for visual instruction tuning datasets.<n>It integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task.<n>It generates general and task-specific captions, and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry.
arXiv Detail & Related papers (2025-06-10T04:04:58Z)
D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding [36.321156992727055]
D2AF is a robust annotation framework for visual grounding using only input images.<n>By implementing dual-driven annotation strategies, we effectively generate detailed region-text pairs.<n>Our findings demonstrate that increasing data volume enhances model performance.
arXiv Detail & Related papers (2025-05-30T09:04:47Z)
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.<n>However, the limited labeled multimodal data often hinders embedding performance.<n>Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z)
Scenario Understanding of Traffic Scenes Through Large Visual Language Models [2.3302708486956454]
Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries.<n>In this study, we evaluate the capabilities of LVLMs to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K.<n>We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets.
arXiv Detail & Related papers (2025-01-28T18:23:12Z)
DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.<n>In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.<n>For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z)
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [65.01625761120924]
We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier)<n>We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.<n>Experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z)
Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models. We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM. Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
SeiT++: Masked Token Modeling Improves Storage-efficient Training [36.95646819348317]
Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks. achieving highly generalizable and high-performing vision models requires expansive datasets, resulting in significant storage requirements. Recent breakthrough by SeiT proposed the use of Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification. In this paper, we extend SeiT by integrating Masked Token Modeling (MTM) for self-supervised pre-training.
arXiv Detail & Related papers (2023-12-15T04:11:34Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE) M3AE learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.