Related papers: Relative-Based Scaling Law for Neural Language Models

Relative-Based Scaling Law for Neural Language Models

URL: http://arxiv.org/abs/2510.20387v1
Date: Thu, 23 Oct 2025 09:37:00 GMT
Title: Relative-Based Scaling Law for Neural Language Models
Authors: Baoqing Yue, Jinyuan Zhou, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu,
Abstract summary: Scaling laws aim to accurately predict model performance across different scales.<n>Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric.<n>We propose the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size.
Score: 26.899273082543612
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.

Related papers

Scaling Laws for Reranking in Information Retrieval [24.00475965133032]
We present the first systematic study of scaling laws for rerankers.<n>Using a detailed case study with cross-encoder rerankers, we demonstrate that performance follows a predictable power law.<n>Our results establish scaling principles for reranking and provide actionable insights for building industrial-grade retrieval systems.
arXiv Detail & Related papers (2026-03-05T05:03:07Z)
Towards a Comprehensive Scaling Law of Mixture-of-Experts [54.117786590884776]
We propose a comprehensive and precise joint MoE scaling law that considers all essential factors.<n>Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size.<n>Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.
arXiv Detail & Related papers (2025-09-28T06:35:34Z)
Scaling Laws for Uncertainty in Deep Learning [18.87399857008617]
We show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes.<n>This work provides strong evidence to dispel recurring skepticism against Bayesian approaches.
arXiv Detail & Related papers (2025-06-11T12:09:05Z)
Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [5.8465717270452195]
We show how scaling law derivation can be used for model and dataset comparison.<n>For the first time, full scaling laws are derived for two important language-vision learning procedures, CLIP and MaMMUT.<n>We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule.
arXiv Detail & Related papers (2025-06-05T03:35:59Z)
Unified Scaling Laws for Compressed Representations [69.72517034565467]
We investigate whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations.<n>Our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric.<n>We extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
arXiv Detail & Related papers (2025-06-02T16:52:51Z)
Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time [73.22651918134808]
This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
arXiv Detail & Related papers (2025-04-04T17:57:22Z)
A Simple Model of Inference Scaling Laws [1.3597551064547502]
We study scaling laws in the context of inference, specifically how performance improves with multiple inference attempts.<n>Our simple framework sets the ground for incorporating inference scaling with other known scaling laws.
arXiv Detail & Related papers (2024-10-21T18:00:06Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z)
A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws. We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z)
Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development. We find that scaling laws emerge at finetuning time in some NLP tasks. For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.