Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training
- URL: http://arxiv.org/abs/2512.21515v1
- Date: Thu, 25 Dec 2025 05:40:46 GMT
- Title: Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training
- Authors: Lei Liu, Hao Zhu, Yue Shen, Zhixuan Chu, Jian Wang, Jinjie Gu, Kui Ren,
- Abstract summary: Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM.<n>We propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss.<n>Our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.
- Score: 46.54209378000497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM. However, the marginal gains from simply increasing data for CPT diminish rapidly, yielding suboptimal data utilization and inefficient training. To address this challenge, we propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss. Our approach leverages the perplexity derived from the pre-trained model on domain data as a proxy for estimating the knowledge gap, effectively quantifying the informational perplexity landscape of candidate training samples. By fitting this scaling law across diverse perplexity regimes, we enable adaptive selection of high-utility data subsets, prioritizing content that maximizes knowledge absorption while minimizing redundancy and noise. Extensive experiments demonstrate that our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.
Related papers
- Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation [27.59197535041953]
Large Language Models (LLMs) represent a promising frontier for recommender systems.<n>This paper introduces a novel, layered framework for generating high-quality synthetic data.<n>We empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data.
arXiv Detail & Related papers (2026-02-07T01:15:15Z) - Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice [109.9635246405237]
We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyper parameters.<n>We introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training.<n> Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation.
arXiv Detail & Related papers (2025-12-30T23:02:44Z) - Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes [7.446140380340418]
Coreset selection aims to accelerate training by identifying small, representative subsets of data that approximate the performance of the full dataset.<n> gradient based methods stand out due to their strong theoretical underpinnings and practical benefits, particularly under limited data budgets.<n>We propose a novel framework that addresses these limitations. First, we establish a connection between posterior sampling and loss landscapes, enabling robust coreset selection even in high data corruption scenarios.
arXiv Detail & Related papers (2025-11-21T17:00:00Z) - Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data [68.85234898614571]
The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data.<n>While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage, the role of such data in pretraining remains unclear.<n>We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training.
arXiv Detail & Related papers (2025-09-26T20:08:51Z) - A Scalable Pretraining Framework for Link Prediction with Efficient Adaptation [16.82426251068573]
Link Prediction (LP) is a critical task in graph machine learning.<n>Existing methods face key challenges including limited supervision from sparse connectivity.<n>We explore pretraining as a solution to address these challenges.
arXiv Detail & Related papers (2025-08-06T17:10:31Z) - APT: Adaptive Personalized Training for Diffusion Models with Limited Data [6.455553965143672]
We propose a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model's internal representations during fine-tuning.<n>Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.
arXiv Detail & Related papers (2025-07-03T14:58:08Z) - Reasoning to Learn from Latent Thoughts [61.2395150828168]
We show that explicitly modeling and inferring the emphlatent thoughts that underlie the text generation process can significantly improve pretraining data efficiency.<n>We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data.
arXiv Detail & Related papers (2025-03-24T16:41:23Z) - Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification [34.37262622415682]
We propose a new adaptation framework called Data Adaptive Traceback.
Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data.
We adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning.
arXiv Detail & Related papers (2024-07-11T18:01:58Z) - Impact of Noisy Supervision in Foundation Model Learning [91.56591923244943]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.<n>We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Learning to Limit Data Collection via Scaling Laws: Data Minimization
Compliance in Practice [62.44110411199835]
We build on literature in machine learning law to propose framework for limiting collection based on data interpretation that ties data to system performance.
We formalize a data minimization criterion based on performance curve derivatives and provide an effective and interpretable piecewise power law technique.
arXiv Detail & Related papers (2021-07-16T19:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.