Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning
- URL: http://arxiv.org/abs/2510.16882v1
- Date: Sun, 19 Oct 2025 15:32:01 GMT
- Title: Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning
- Authors: Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji,
- Abstract summary: Supervised fine-tuning (SFT) is computationally expensive and sometimes suffers from overfitting or bias amplification.<n>This work studies the online batch selection family that dynamically scores and filters samples during the training process.<n>We develop textbfUDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT.
- Score: 49.04912820721943
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops \textbf{UDS (Utility-Diversity Sampling)}, a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.
Related papers
- CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization [14.304308878028358]
Multimodal large language models rely heavily on instruction tuning to align vision and language capabilities.<n>Existing data selection methods aim to select important and diverse subsets, but they often suffer from two critical drawbacks.<n>We introduce CoIDO, a novel dual-objective framework that jointly optimize data importance and diversity to overcome these challenges.
arXiv Detail & Related papers (2025-10-11T09:41:21Z) - InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities [27.09178257629886]
InfiAlign is a scalable and sample-efficient post-training framework for large language models (LLMs)<n>At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning.<n>Our results highlight the effectiveness of combining principled data selection with full-stage post-training.
arXiv Detail & Related papers (2025-08-07T15:34:06Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - OASIS: Online Sample Selection for Continual Visual Instruction Tuning [55.92362550389058]
In continual instruction tuning (CIT) scenarios, new instruction tuning data continuously arrive in an online streaming manner.<n>Data selection can mitigate this overhead, but existing strategies often rely on pretrained reference models.<n>Recent reference model-free online sample selection methods address this, but typically select a fixed number of samples per batch.
arXiv Detail & Related papers (2025-05-27T20:32:43Z) - Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Rethinking Data Selection at Scale: Random Selection is Almost All You Need [39.14807071480125]
Supervised fine-tuning is crucial for aligning Large Language Models with human instructions.<n>Most existing data selection techniques are designed for small-scale data pools.
arXiv Detail & Related papers (2024-10-12T02:48:34Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.