Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time
- URL: http://arxiv.org/abs/2504.03635v3
- Date: Sun, 28 Sep 2025 21:01:31 GMT
- Title: Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time
- Authors: Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen,
- Abstract summary: This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
- Score: 73.22651918134808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning is an integral part of many tasks performed by language models (LMs). However, the effects of scaling model sizes and data on reasoning abilities at pretraining time remain understudied. To rigorously investigate this problem, we pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining. Interestingly, we observe that overparameterization can impair the implicit reasoning performance due to excessive memorization. We investigate different factors that affect the loss curve when scaling different components of the knowledge graph, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling law that shows optimal-sized LMs can approximately reason over 0.008 bit information per parameter. This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in LLMs.
Related papers
- Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks [17.067788440109137]
Mixture-of-Experts (MoE) models are now standard in state-of-the-art systems.<n>We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills.
arXiv Detail & Related papers (2025-08-26T04:31:28Z) - Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones [19.390296419822562]
We show the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling.<n>We validate our theoretical findings with comprehensive experiments across a range of language models.
arXiv Detail & Related papers (2025-05-27T23:23:34Z) - G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning [58.73279333365234]
Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale graph reasoning abilities.<n>With RL on Erdos, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size)<n>Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks.
arXiv Detail & Related papers (2025-05-24T04:33:41Z) - Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.<n>We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.<n>We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs [4.701165676405066]
It is critical not only to retrieve relevant information but also to provide causal reasoning and explainability.<n>This paper proposes a novel pipeline that filters large knowledge graphs to emphasize cause-effect edges.<n> Experiments on medical question-answering tasks show consistent gains, with up to a 10% absolute improvement.
arXiv Detail & Related papers (2025-01-24T19:31:06Z) - What Do LLMs Need to Understand Graphs: A Survey of Parametric Representation of Graphs [69.48708136448694]
Large language models (LLMs) are reorganizing in the AI community for their expected reasoning and inference abilities.<n>We believe this kind of parametric representation of graphs, graph laws, can be a solution for making LLMs understand graph data as the input.
arXiv Detail & Related papers (2024-10-16T00:01:31Z) - Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path [53.71787069694794]
We focus on the graph reasoning ability of Large Language Models (LLMs)<n>We revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem.<n>Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these fundamental tasks.
arXiv Detail & Related papers (2024-08-18T16:26:39Z) - Temporal Scaling Law for Large Language Models [70.74571133406958]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z) - Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation [110.71955853831707]
We view LMs as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time.
We formalize the reasoning paths as random walk paths on the knowledge/reasoning graphs.
Experiments and analysis on multiple KG and CoT datasets reveal the effect of training on random walk paths.
arXiv Detail & Related papers (2024-02-05T18:25:51Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - GraphLLM: Boosting Graph Reasoning Ability of Large Language Model [7.218768686958888]
GraphLLM is a pioneering end-to-end approach that integrates graph learning models with Large Language Models.
Our empirical evaluations across four fundamental graph reasoning tasks validate the effectiveness of GraphLLM.
The results exhibit a substantial average accuracy enhancement of 54.44%, alongside a noteworthy context reduction of 96.45%.
arXiv Detail & Related papers (2023-10-09T16:42:00Z) - The Cost of Down-Scaling Language Models: Fact Recall Deteriorates
before In-Context Learning [34.76303922401322]
We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model.
We find a striking difference in how these two abilities evolve due to scaling.
The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.
arXiv Detail & Related papers (2023-10-07T03:36:39Z) - GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach [0.0]
Large Language Models (LLMs) have showcased impressive reasoning capabilities.
In this paper, we introduce a novel graph-based method to further augment the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2023-08-18T03:12:59Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale [5.759319006531332]
We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters.
We examine downscaling effects, extending scaling laws to models as small as 1M parameters.
arXiv Detail & Related papers (2023-05-26T21:22:10Z) - Beyond spectral gap (extended): The role of the topology in
decentralized learning [58.48291921602417]
In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model.
Current theory does not explain that collaboration enables larger learning rates than training alone.
This paper aims to paint an accurate picture of sparsely-connected distributed optimization.
arXiv Detail & Related papers (2023-01-05T16:53:38Z) - CLEAR: Generative Counterfactual Explanations on Graphs [60.30009215290265]
We study the problem of counterfactual explanation generation on graphs.
A few studies have explored counterfactual explanations on graphs, but many challenges of this problem are still not well-addressed.
We propose a novel framework CLEAR which aims to generate counterfactual explanations on graphs for graph-level prediction models.
arXiv Detail & Related papers (2022-10-16T04:35:32Z) - Beyond spectral gap: The role of the topology in decentralized learning [58.48291921602417]
In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model.
This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution.
Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.
arXiv Detail & Related papers (2022-06-07T08:19:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.