LLM Data Selection and Utilization via Dynamic Bi-level Optimization
- URL: http://arxiv.org/abs/2507.16178v1
- Date: Tue, 22 Jul 2025 02:47:12 GMT
- Title: LLM Data Selection and Utilization via Dynamic Bi-level Optimization
- Authors: Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao,
- Abstract summary: We propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during training.<n>Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data.<n>We further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
- Score: 100.20933466418786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
Related papers
- IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment [29.703775936837012]
Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets.<n>When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance.<n>We introduce an innovative data equilibrium framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets.
arXiv Detail & Related papers (2025-05-19T06:42:44Z) - Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance.
It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z) - Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws [59.03420759554073]
We introduce Adaptive Data Optimization (ADO), an algorithm that optimize data distributions in an online fashion, concurrent with model training.
ADO does not require external knowledge, proxy models, or modifications to the model update.
ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly.
arXiv Detail & Related papers (2024-10-15T17:47:44Z) - A Self-Supervised Paradigm for Data-Efficient Medical Foundation Model Pre-training: V-information Optimization Framework [15.413974936297082]
Self-supervised pre-training medical foundation models on large-scale datasets demonstrate exceptional performance.<n>Recent research challenges this common paradigm by introducing data-effective learning approaches.<n>We introduce V-information into self-supervised pre-training of foundation models to provide a theoretical foundation for sample selection.
arXiv Detail & Related papers (2024-08-13T10:28:54Z) - MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [16.654859430784825]
Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining.
We introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress.
Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks.
arXiv Detail & Related papers (2024-06-10T06:27:42Z) - Smaller Language Models are capable of selecting Instruction-Tuning
Training Data for Larger Language Models [39.65879784788677]
We introduce a novel training data selection based on the learning percentage of the samples.
We assert that current language models possess the capability to autonomously select high-quality training data.
Our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
arXiv Detail & Related papers (2024-02-16T03:39:37Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Dynamic Model Switching for Improved Accuracy in Machine Learning [0.0]
We introduce an adaptive ensemble that intuitively transitions between CatBoost and XGBoost.
The user sets a benchmark, say 80% accuracy, prompting the system to dynamically shift to the new model only if it guarantees improved performance.
This dynamic model-switching mechanism aligns with the evolving nature of data in real-world scenarios.
arXiv Detail & Related papers (2024-01-31T00:13:02Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.