On Representation Redundancy in Large-Scale Instruction Tuning Data Selection
- URL: http://arxiv.org/abs/2602.13773v1
- Date: Sat, 14 Feb 2026 13:35:34 GMT
- Title: On Representation Redundancy in Large-Scale Instruction Tuning Data Selection
- Authors: Youwei Shu, Shaomian Zheng, Dingnan Jin, Wenjie Qu, Ziyao Guo, Qing Cui, Jun Zhou, Jiaheng Zhang,
- Abstract summary: We study instruction-tuning data selection through the lens of semantic representation similarity.<n>We propose Compressed Representation Data Selection (CRDS), a novel framework with two variants.<n> Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods.
- Score: 20.850719141827664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.
Related papers
- Exploring Instruction Data Quality for Explainable Image Quality Assessment [58.345719195248314]
We investigate the role of data quality of instruction tuning dataset for explainable IQA.<n>We find that selecting a subset of the data set randomly can even lead to better results than training with the entire instruction tuning dataset.<n>We propose a clustering-based data selection framework with three stages: clustering feature extraction, cluster quota allocation, and cluster sampling strategy.
arXiv Detail & Related papers (2025-10-04T17:12:54Z) - RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment [10.284993431741377]
We introduce the concept of epsilon-sample cover, which quantifies sample redundancy based on inter-sample relationships.<n>We reformulate data selection as a reinforcement learning process and propose RL-Selector.<n>Our method consistently outperforms existing state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-26T06:28:56Z) - When Dynamic Data Selection Meets Data Augmentation [10.217776379089093]
We propose a novel online data training framework that unifies dynamic data selection and augmentation.<n>Our method estimates each sample's joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples.<n>Our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.
arXiv Detail & Related papers (2025-05-02T11:38:48Z) - Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [50.492124556982674]
This paper introduces a novel choice-based sample selection framework.<n>It shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>We validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
arXiv Detail & Related papers (2025-03-04T07:32:41Z) - Adaptive Dataset Quantization [2.0105434963031463]
We introduce a versatile framework for dataset compression, namely Adaptive dataset Quantization (ADQ)<n>We propose a novel adaptive sampling strategy through the evaluation of generated bins' representativeness score, diversity score and importance score.<n>Our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results, surpassing DQ by average 3% on various datasets.
arXiv Detail & Related papers (2024-12-22T07:08:29Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.<n>Data selection has shown promise in identifying the most representative samples from the entire dataset.<n>We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Improving Data Efficiency via Curating LLM-Driven Rating Systems [30.233724785974143]
We introduce DS2, a Diversity-aware Score curation method for Data Selection.<n>By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples.<n>Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks.
arXiv Detail & Related papers (2024-10-09T10:07:55Z) - Reducing and Exploiting Data Augmentation Noise through Meta Reweighting
Contrastive Learning for Text Classification [3.9889306957591755]
We propose a novel framework to boost deep learning models' performance given augmented data/samples in text classification tasks.
We propose novel weight-dependent enqueue and dequeue algorithms to utilize augmented samples' weight/quality information effectively.
Our framework achieves an average of 1.6%, up to 4.3% absolute improvement on Text-CNN encoders and an average of 1.4%, up to 4.4% absolute improvement on RoBERTa-base encoders.
arXiv Detail & Related papers (2024-09-26T02:19:13Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Modality-Agnostic Variational Compression of Implicit Neural
Representations [96.35492043867104]
We introduce a modality-agnostic neural compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR)
Bridging the gap between latent coding and sparsity, we obtain compact latent representations non-linearly mapped to a soft gating mechanism.
After obtaining a dataset of such latent representations, we directly optimise the rate/distortion trade-off in a modality-agnostic space using neural compression.
arXiv Detail & Related papers (2023-01-23T15:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.