Related papers: BiMix: Bivariate Data Mixing Law for Language Model Pretraining

BiMix: Bivariate Data Mixing Law for Language Model Pretraining

URL: http://arxiv.org/abs/2405.14908v3
Date: Tue, 15 Oct 2024 03:40:30 GMT
Title: BiMix: Bivariate Data Mixing Law for Language Model Pretraining
Authors: Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding,
Abstract summary: The impact of pretraining data composition on model performance remains poorly understood. $textbfBiMix$ provides a systematic framework for understanding and optimizing data mixtures. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency.
Score: 47.77701041534746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.

Related papers

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework [10.317740844867913]
We build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple acquisition functions can enable principled training decisions across training models from 20M to 1B kernels.
arXiv Detail & Related papers (2025-03-26T22:19:47Z)
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models [24.396525123797073]
We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture. We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures.
arXiv Detail & Related papers (2025-02-21T21:27:48Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance. We introduce novel algorithms for dynamic, instance-level data reweighting. Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data. We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective. We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z)
On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z)
AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent. We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales. Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms. We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law. Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z)
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training [42.142924806184425]
Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
arXiv Detail & Related papers (2023-12-14T12:02:24Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators [6.537685198688539]
We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods. We include in each of them a mixing parameter $alpha$, controlling the weight given to the unlabeled data. We demonstrate the effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models.
arXiv Detail & Related papers (2023-02-19T09:55:18Z)
DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix. DoubleMix first generates several perturbed samples for each training data. It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z)
DRFLM: Distributionally Robust Federated Learning with Inter-client Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data. We propose a general framework to solve the above two challenges simultaneously. We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z)
Distributed Learning of Finite Gaussian Mixtures [21.652015112462]
We study split-and-conquer approaches for the distributed learning of finite Gaussian mixtures. New estimator is shown to be consistent and retains root-n consistency under some general conditions. Experiments based on simulated and real-world data show that the proposed split-and-conquer approach has comparable statistical performance with the global estimator.
arXiv Detail & Related papers (2020-10-20T16:17:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.