Related papers: Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder

Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder

URL: http://arxiv.org/abs/2502.14050v2
Date: Mon, 31 Mar 2025 21:41:42 GMT
Title: Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder
Authors: Xianjun Yang, Shaoliang Nie, Lijuan Liu, Suchin Gururangan, Ujjwal Karn, Rui Hou, Madian Khabsa, Yuning Mao,
Abstract summary: We propose sparse autoencoders (SAEs) to tackle the challenge of data diversity measure.<n>We experimentally prove that models trained on our selected data can outperform other methods in terms of model capabilities.<n>We will release our trained SAEs for use by the broader community.
Score: 45.64824340565906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction tuning data are often quantity-saturated due to the large volume of data collection and fast model iteration, leaving data selection important but underexplored. Existing quality-driven data selection methods, such as LIMA (NeurIPS 2023 \citep{zhou2024lima}) and AlpaGasus (ICLR 2024 \citep{chenalpagasus}) generally ignore the equal importance of data diversity and complexity. In this work, we aim to design a diversity-aware data selection strategy and creatively propose using sparse autoencoders (SAEs) to tackle the challenge of data diversity measure. In addition, SAEs can also provide more interpretability of model behavior and explain, e.g., the surprising effectiveness of selecting the longest response (ICML 2024 \citep{zhaolong}). Using effective data selection, we experimentally prove that models trained on our selected data can outperform other methods in terms of model capabilities, reduce training cost, and potentially gain more control over model behaviors. We prove that SAEs can serve as a good alternative to diversity measure and design our method to be scalable for potential industrial large-scale pruning, and we will also release our trained SAEs for use by the broader community.

Related papers

Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations [0.0]
There is no clearly established methodology for effective training data selection.<n>Model Internal Representations (KAMIR) is a novel approach that overcomes these limitations.<n>It can be applied to a wide range of tasks such as machine reading comprehension and summarization.
arXiv Detail & Related papers (2025-09-09T01:08:15Z)
Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning [59.56171041796373]
We harvest multi-modal instructional data in a robust and efficient manner. We take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods.
arXiv Detail & Related papers (2025-03-17T17:11:22Z)
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [50.492124556982674]
This paper introduces a novel choice-based sample selection framework.<n>It shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>We validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
arXiv Detail & Related papers (2025-03-04T07:32:41Z)
Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process. vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner. We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z)
Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z)
Reinforced Data Sampling for Model Diversification [15.547681142342846]
This paper proposes a new Reinforced Data Sampling (RDS) method to learn how to sample data adequately. We formulate the optimisation problem of model diversification $delta-div$ in data sampling to maximise learning potentials and optimum allocation by injecting model diversity. Our results suggest that the trainable sampling for model diversification is useful for competition organisers, researchers, or even starters to pursue full potentials of various machine learning tasks.
arXiv Detail & Related papers (2020-06-12T11:46:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.