IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment
- URL: http://arxiv.org/abs/2505.12762v1
- Date: Mon, 19 May 2025 06:42:44 GMT
- Title: IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment
- Authors: Chenlin Ming, Chendi Qu, Mengzhang Cai, Qizhi Pei, Zhuoshi Pan, Yu Li, Xiaoming Duan, Lijun Wu, Conghui He,
- Abstract summary: Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets.<n>When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance.<n>We introduce an innovative data equilibrium framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets.
- Score: 29.703775936837012
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets. When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance. Unlike many studies that focus on enhancing the quality of training datasets through data selection methods, few works explore the intricate relationship between the compositional quantity of mixture training datasets and the emergent capabilities of LLMs. Given the availability of a high-quality multi-domain training dataset, understanding the impact of data from each domain on the model's overall capabilities is crucial for preparing SFT data and training a well-balanced model that performs effectively across diverse domains. In this work, we introduce IDEAL, an innovative data equilibrium adaptation framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets, thereby enhancing the model's alignment and performance across multiple capabilities. IDEAL employs a gradient-based approach to iteratively refine the training data distribution, dynamically adjusting the volumes of domain-specific data based on their impact on downstream task performance. By leveraging this adaptive mechanism, IDEAL ensures a balanced dataset composition, enabling the model to achieve robust generalization and consistent proficiency across diverse tasks. Experiments across different capabilities demonstrate that IDEAL outperforms conventional uniform data allocation strategies, achieving a comprehensive improvement of approximately 7% in multi-task evaluation scores.
Related papers
- LLM Data Selection and Utilization via Dynamic Bi-level Optimization [100.20933466418786]
We propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during training.<n>Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data.<n>We further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
arXiv Detail & Related papers (2025-07-22T02:47:12Z) - Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning [49.10890099624699]
We introduce a dynamic dataset pruning framework that adaptively selects training samples based on task-driven difficulty and cross-modality semantic consistency.<n>Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
arXiv Detail & Related papers (2025-07-17T03:08:26Z) - HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models [71.12193680015622]
Fine-tuning large language models on diverse datasets poses challenges due to data imbalance and heterogeneity.<n>We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation.<n>We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups.
arXiv Detail & Related papers (2025-05-18T08:31:44Z) - Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data [36.277423093218275]
We study the role of data diversity in enhancing the overall abilities of large language models (LLMs)<n>We propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data.
arXiv Detail & Related papers (2025-02-05T17:21:01Z) - On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - Mixture-of-Skills: Learning to Optimize Data Usage for Fine-Tuning Large Language Models [45.51085356985464]
Large language models (LLMs) are typically fine-tuned on diverse and extensive datasets sourced from various origins.
MoS learns to optimize data usage automatically during the fine-tuning process.
MoSpec harnesses the utilities of various datasets for a specific purpose.
arXiv Detail & Related papers (2024-06-13T05:01:28Z) - Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning [64.5243480989869]
coding data is known to boost reasoning abilities during pretraining.<n>Its role in activating internal reasoning capacities during IFT remains understudied.<n>This paper investigates how coding data impact LLMs' reasoning capacities during IFT stage.
arXiv Detail & Related papers (2024-05-30T23:20:25Z) - BiMix: A Bivariate Data Mixing Law for Language Model Pretraining [47.77701041534746]
The impact of pretraining data composition on model performance remains poorly understood.<n>$textbfBiMix$ provides a systematic framework for understanding and optimizing data mixtures.<n>Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency.
arXiv Detail & Related papers (2024-05-23T09:44:02Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - FedDM: Iterative Distribution Matching for Communication-Efficient
Federated Learning [87.08902493524556]
Federated learning(FL) has recently attracted increasing attention from academia and industry.
We propose FedDM to build the global training objective from multiple local surrogate functions.
In detail, we construct synthetic sets of data on each client to locally match the loss landscape from original data.
arXiv Detail & Related papers (2022-07-20T04:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.