Related papers: Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

URL: http://arxiv.org/abs/2507.10613v1
Date: Sun, 13 Jul 2025 15:15:24 GMT
Title: Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs
Authors: Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang,
Abstract summary: We examine the impact of data quality and training strategies on model performance.<n>We identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling.<n>We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes.
Score: 35.95748363172419
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance improvements decelerate, which is a phenomenon known as sub-scaling. This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical analysis of over 400 models, we identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling. High data density leads to diminishing returns due to redundant information, while optimal resource allocation is crucial for sustained performance improvements. We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes, highlighting the importance of data quality and diversity.

Related papers

Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training [46.54209378000497]
Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM.<n>We propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss.<n>Our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.
arXiv Detail & Related papers (2025-12-25T05:40:46Z)
Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression [53.39128997308138]
We introduce information capacity, a measure of model efficiency based on text compression performance.<n> Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity.<n>A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts.
arXiv Detail & Related papers (2025-11-11T10:07:32Z)
Layer-Aware Influence for Online Data Valuation Estimation [32.294500546369136]
Data-centric learning emphasizes curating high-quality training samples to boost performance.<n>A central problem is to estimate the influence of training sample efficiently.<n>We develop a layer-aware online estimator that requires only loss-to-output gradients.
arXiv Detail & Related papers (2025-10-14T15:34:22Z)
Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining [13.89166201149496]
We propose a quality-aware scaling law extending the Chinchilla framework to predict loss as a joint function of model size, data volume, and data quality.<n>We show that loss scales predictably with data quality and that higher-quality data can substantially reduce model size and hence compute requirements.
arXiv Detail & Related papers (2025-09-30T22:45:06Z)
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning [42.80470927369973]
We study how model scale, data volume, and computational budget interact to shape performance.<n>We find that larger models trained for fewer steps consistently outperform smaller models trained for more steps.<n>In data-constrained regimes, repeated reuse of high-quality data proves highly effective.
arXiv Detail & Related papers (2025-09-29T17:10:35Z)
Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies [66.83950068218033]
Scaling Laws demonstrate that scaling model parameters and training data enhances learning performance.<n>Despite its potential to improve performance, the integration of scaling laws into deep reinforcement learning has not been fully realized.<n>This review addresses this gap by systematically analyzing scaling strategies in three dimensions: data, network, and training budget.
arXiv Detail & Related papers (2025-08-05T08:03:12Z)
LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment [14.655048266761783]
Reinforcement learning (RL) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck.<n>We present LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training.<n> Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements.
arXiv Detail & Related papers (2025-06-13T06:05:58Z)
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [21.053622641336744]
Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks.<n>Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
arXiv Detail & Related papers (2025-02-17T18:45:25Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z)
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [61.13296177652599]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales.<n>We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
HARE: HumAn pRiors, a key to small language model Efficiency [6.253561984966316]
Human priors play a crucial role in efficiently utilizing data in deep learning. Existing Small Language Models mainly rely on web-scraped large-scale training data. We propose a principle to leverage human priors for data construction.
arXiv Detail & Related papers (2024-06-17T10:56:03Z)
Rethinking Overlooked Aspects in Vision-Language Models [32.525916879333145]
Recent advancements in vision-language models (LVLMs) have been substantial. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets.
arXiv Detail & Related papers (2024-05-20T07:53:41Z)
Scaling Laws For Dense Retrieval [22.76001461620846]
We investigate whether the performance of dense retrieval models follows the scaling law as other neural models. Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations.
arXiv Detail & Related papers (2024-03-27T15:27:36Z)
Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z)
Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets. We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.