Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
- URL: http://arxiv.org/abs/2508.16745v1
- Date: Fri, 22 Aug 2025 18:57:08 GMT
- Title: Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
- Authors: Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev,
- Abstract summary: We show how different architectures and training methods affect model multi-step reasoning capabilities.<n>We confirm that increasing model depth plays a crucial role for sequential computations.
- Score: 60.63703438729223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning is a core capability of large language models, yet understanding how they learn and perform multi-step reasoning remains an open problem. In this study, we explore how different architectures and training methods affect model multi-step reasoning capabilities within a cellular automata framework. By training on state sequences generated with random Boolean functions for random initial conditions to exclude memorization, we demonstrate that most neural architectures learn to abstract the underlying rules. While models achieve high accuracy in next-state prediction, their performance declines sharply if multi-step reasoning is required. We confirm that increasing model depth plays a crucial role for sequential computations. We demonstrate that an extension of the effective model depth with recurrence, memory, and test-time compute scaling substantially enhances reasoning capabilities.
Related papers
- Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality [1.5994376682356057]
We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions.<n>Both models exhibit a transition from brute-force to analytic strategies as complexity increases.<n>These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward.
arXiv Detail & Related papers (2026-02-10T22:07:05Z) - Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts [19.518525241726916]
Encode-Think-Decode (ETD) is a method that enhances the reasoning capabilities of a base model by training it to iterate over a small subset of reasoning-relevant layers during the mid-training stage.<n>ETD models yield substantial gains on 17 reasoning benchmarks, including +28.4% relative accuracy improvement on GSM8K and +36% on MATH with the OLMo-2 1B Base model.
arXiv Detail & Related papers (2025-10-08T15:58:35Z) - RARE: Retrieval-Augmented Reasoning Modeling [41.24577920467858]
We propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization.<n>RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training.<n>Experiments demonstrate that lightweight-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and DeepSeek-R1 up to approximately 20% accuracy.
arXiv Detail & Related papers (2025-03-30T16:49:44Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Learning Elementary Cellular Automata with Transformers [3.7013865226473848]
We show that Transformers can learn to abstract and generalize the rules governing Elementary Cellular Automata.<n>Our analysis reveals that including future states or rule prediction in the training loss enhances the models' ability to form internal representations of the rules.
arXiv Detail & Related papers (2024-12-02T11:57:49Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer Mechanism [68.05754701230039]
We construct a symbolic multi-step reasoning task to investigate the information propagation mechanisms in Transformer models.<n>We propose a random matrix-based algorithm to enhance the model's reasoning ability.
arXiv Detail & Related papers (2024-05-24T07:41:26Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - A Brain-Inspired Sequence Learning Model based on a Logic [6.653734987585243]
In this paper, a model of sequence learning, which is interpretable through Non-Axiomatic Logic, is designed and tested.
The results show that the model works well within different levels of difficulty.
arXiv Detail & Related papers (2023-08-24T01:01:41Z) - Deep networks for system identification: a Survey [56.34005280792013]
System identification learns mathematical descriptions of dynamic systems from input-output data.
Main aim of the identified model is to predict new data from previous observations.
We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
arXiv Detail & Related papers (2023-01-30T12:38:31Z) - TimeSHAP: Explaining Recurrent Models through Sequence Perturbations [3.1498833540989413]
Recurrent neural networks are a standard building block in numerous machine learning domains.
The complex decision-making in these models is seen as a black-box, creating a tension between accuracy and interpretability.
In this work, we contribute to filling these gaps by presenting TimeSHAP, a model-agnostic recurrent explainer.
arXiv Detail & Related papers (2020-11-30T19:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.