Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
- URL: http://arxiv.org/abs/2503.21023v1
- Date: Wed, 26 Mar 2025 22:19:47 GMT
- Title: Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
- Authors: Thomson Yen, Andrew Wei Tung Siah, Haozhe Chen, Tianyi Peng, Daniel Guetta, Hongseok Namkoong,
- Abstract summary: We build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset.<n>We observe that even simple acquisition functions can enable principled training decisions across training models from 20M to 1B kernels.
- Score: 10.317740844867913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data domains and downstream tasks. Although scaling laws can provide a principled and general approach for data curation, standard deterministic extrapolation from small-scale experiments to larger scales requires strong assumptions on the reliability of such extrapolation, whose brittleness has been highlighted in prior works. In this paper, we introduce a $\textit{probabilistic extrapolation framework}$ for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problem$\unicode{x2013}$multi-fidelity, multi-scale Bayesian optimization$\unicode{x2013}$where $\{$data mixtures, model scale, training steps$\}$ are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve $\textbf{2.6x}$ and $\textbf{3.3x}$ speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.
Related papers
- Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection [65.96556073745197]
DiverSified File selection algorithm (DiSF) is proposed to select the most decorrelated text files in the feature space.
DiSF saves 98.5% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget.
arXiv Detail & Related papers (2025-04-29T11:13:18Z) - Probabilistic Federated Prompt-Tuning with Non-IID and Imbalanced Data [35.47385526394076]
Fine-tuning pre-trained models is a popular approach in machine learning for solving complex tasks with moderate data.
Fine-tuning the entire pre-trained model is ineffective in federated data scenarios where local data distributions are diversely skewed.
Our approach transforms federated learning into a distributed set modeling task, aggregating diverse sets of prompts to globally fine-tune the pre-trained model.
arXiv Detail & Related papers (2025-02-27T04:31:34Z) - Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning [18.381178799923514]
Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT)<n>We study how to optimally allocate a fixed training data budget between the two stages.
arXiv Detail & Related papers (2025-02-16T21:57:35Z) - Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data.<n>We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.<n>We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z) - Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing [20.445496441396028]
We introduce a novel framework combining model cascading and inference-time self-testing algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff.
Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions.
Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case.
arXiv Detail & Related papers (2024-05-24T16:20:04Z) - BiMix: A Bivariate Data Mixing Law for Language Model Pretraining [47.77701041534746]
The impact of pretraining data composition on model performance remains poorly understood.<n>$textbfBiMix$ provides a systematic framework for understanding and optimizing data mixtures.<n>Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency.
arXiv Detail & Related papers (2024-05-23T09:44:02Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Towards Accelerated Model Training via Bayesian Data Selection [45.62338106716745]
We propose a more reasonable data selection principle by examining the data's impact on the model's generalization loss.
Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss.
This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models.
arXiv Detail & Related papers (2023-08-21T07:58:15Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.