TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning
- URL: http://arxiv.org/abs/2510.07118v1
- Date: Wed, 08 Oct 2025 15:11:04 GMT
- Title: TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning
- Authors: Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy,
- Abstract summary: We introduce a forward-only, token-centric framework for instruction tuning.<n>Instead of using gradients, it operates by matching underlying representational patterns identified via attention-based "fingerprints"<n>Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings.
- Score: 13.859040990742534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
Related papers
- D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning [49.16227597771663]
D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
arXiv Detail & Related papers (2025-12-22T14:42:31Z) - TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning [24.98742538077939]
Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size.<n>Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer.<n>This work introduces a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer.
arXiv Detail & Related papers (2025-05-22T14:53:53Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach [17.79010397902909]
We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from $n$ auxiliary tasks.<n>This problem has broad applications in NLP, such as targeted instruction tuning and data selection in chain-of-thought fine-tuning.<n>We introduce a new algorithm for estimating model fine-tuning performance without requiring repeated training.
arXiv Detail & Related papers (2024-09-28T21:26:50Z) - TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset.
We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS)
Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z) - LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds [62.49198183539889]
We propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds.
Our method co-designs an efficient labeling process with semi/weakly supervised learning.
Our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
arXiv Detail & Related papers (2022-10-14T19:13:36Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Bi-level Alignment for Cross-Domain Crowd Counting [113.78303285148041]
Current methods rely on external data for training an auxiliary task or apply an expensive coarse-to-fine estimation.
We develop a new adversarial learning based method, which is simple and efficient to apply.
We evaluate our approach on five real-world crowd counting benchmarks, where we outperform existing approaches by a large margin.
arXiv Detail & Related papers (2022-05-12T02:23:25Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.