Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
- URL: http://arxiv.org/abs/2505.18091v1
- Date: Fri, 23 May 2025 16:46:24 GMT
- Title: Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
- Authors: Xinran Gu, Kaifeng Lyu, Jiazheng Li, Jingzhao Zhang,
- Abstract summary: We show that knowledge acquisition from knowledge-dense datasets does not always follow a smooth scaling law.<n>We show that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size.
- Score: 21.923905379872863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.
Related papers
- A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z) - Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining [39.75559743003037]
We formalize the concept of two-phase pretraining and conduct a systematic study on how to select and mix data to maximize model accuracies.<n>We provide in-depth guidance on crafting optimal blends based on quality of the data source and the number of epochs to be seen.<n>We propose to design blends using downsampled data at a smaller scale of 1T tokens and then demonstrate effective scaling of our approach to larger token horizon of 15T tokens and larger model size of 25B model size.
arXiv Detail & Related papers (2024-12-18T18:41:18Z) - BiMix: A Bivariate Data Mixing Law for Language Model Pretraining [47.77701041534746]
The impact of pretraining data composition on model performance remains poorly understood.<n>$textbfBiMix$ provides a systematic framework for understanding and optimizing data mixtures.<n>Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency.
arXiv Detail & Related papers (2024-05-23T09:44:02Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - On Memorization in Diffusion Models [44.031805633114985]
We show that memorization behaviors tend to occur on smaller-sized datasets.<n>We quantify the impact of the influential factors on these memorization behaviors in terms of effective model memorization (EMM)<n>Our study holds practical significance for diffusion model users and offers clues to theoretical research in deep generative models.
arXiv Detail & Related papers (2023-10-04T09:04:20Z) - Over-training with Mixup May Hurt Generalization [32.64382185990981]
We report a previously unobserved phenomenon in Mixup training.
On a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs.
We show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data.
arXiv Detail & Related papers (2023-03-02T18:37:34Z) - Semiparametric Language Models Are Scalable Continual Learners [83.74414880208334]
Semiparametric language models (LMs) have shown promise in continuously learning from new text data.
We present a simple and intuitive approach called Selective Memorization (SeMem)
SeMem only memorizes difficult samples that the model is likely to struggle with.
arXiv Detail & Related papers (2023-03-02T17:15:02Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.