CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning
- URL: http://arxiv.org/abs/2509.04027v2
- Date: Thu, 25 Sep 2025 06:48:38 GMT
- Title: CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning
- Authors: Zeyu Gan, Hao Yi, Yong Liu,
- Abstract summary: CoT-Space is a novel theoretical framework that recasts Large Language Models (LLMs) reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space.<n>We show that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting.
- Score: 14.337056020596465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. This shift in perspective serves as a conceptual bridge, revitalizing foundational principles from classical learning theory to analyze the unique dynamics of LLMs. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents. We open-source our code at https://github.com/ZyGan1999/CoT-Space.
Related papers
- A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization [2.635536317968963]
A central question is why only a few epochs of fine-tuning are typically sufficient to achieve strong performance on many different tasks.<n>We develop a statistical framework, combining rigorous early stopping theory with the attention-based Neural Tangent Kernel (NTK) for large language models.
arXiv Detail & Related papers (2026-02-15T00:43:21Z) - Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts [11.888882732753922]
Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input.<n>We build the first unified theoretical framework that derives these practices as optimal posterior approximation and prior regularization from a Bayesian perspective.<n>Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.
arXiv Detail & Related papers (2026-01-07T04:45:07Z) - Beyond the Black Box: Theory and Mechanism of Large Language Models [39.10631426330405]
The rapid emergence of Large Language Models (LLMs) has precipitated a profound paradigm shift in Artificial Intelligence.<n>This survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation.
arXiv Detail & Related papers (2026-01-06T10:45:53Z) - How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models [57.42778606399764]
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation.<n>Current reinforcement learning approaches often rely on sparse, outcome-based rewards.<n>We argue that this stems from a fundamental mismatch with the natural structure of reasoning.
arXiv Detail & Related papers (2025-10-02T00:34:15Z) - Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind [0.0]
We report a structural convergence among four influential theories of mind.<n>The convergence emerges unintentionally within a practical AI architecture known as Agentic Flow.<n>This paper proposes that intelligent architectures may evolve toward shared structural patterns shaped by practical constraints.
arXiv Detail & Related papers (2025-07-22T02:54:45Z) - CTRLS: Chain-of-Thought Reasoning via Latent State-Transition [57.51370433303236]
Chain-of-thought (CoT) reasoning enables large language models to break down complex problems into interpretable intermediate steps.<n>We introduce groundingS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions.<n>We show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
arXiv Detail & Related papers (2025-07-10T21:32:18Z) - Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective [6.963986923957048]
VAPO is a framework for reinforcement learning for large language models.<n>It addresses challenges such as value model bias, heterogeneous sequence lengths, and sparse reward signals.<n>This paper explores VAPO from a theoretical perspective, highlighting areas where its assumptions might be challenged.
arXiv Detail & Related papers (2025-05-23T15:03:41Z) - The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning [39.613595533503144]
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models.<n>We show that CoT consistently underperforms direct answering across varying model scales and benchmark complexities.<n>Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL.
arXiv Detail & Related papers (2025-04-07T13:51:06Z) - Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought [61.588465852846646]
Chain-of-Thought (CoT) reasoning has emerged as a promising approach for enhancing the performance of large language models (LLMs)
In this work, we introduce a novel reasoning boundary framework (RBF) to address these challenges.
arXiv Detail & Related papers (2024-10-08T05:26:28Z) - Understanding Reasoning in Chain-of-Thought from the Hopfieldian View [17.18897746431302]
We introduce a novel perspective grounded in the Hopfieldian view of cognition in cognitive neuroscience.
We establish a connection between Chain-of-Thought (CoT) reasoning and key cognitive elements such as stimuli, actions, neural populations, and representation spaces.
We propose the Representation-of-Thought (RoT) framework, which leverages the robustness of low-dimensional representation spaces to enhance the robustness of the reasoning process in CoTs.
arXiv Detail & Related papers (2024-10-04T16:55:30Z) - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z) - A Principled Framework for Knowledge-enhanced Large Language Model [58.1536118111993]
Large Language Models (LLMs) are versatile, yet they often falter in tasks requiring deep and reliable reasoning.
This paper introduces a rigorously designed framework for creating LLMs that effectively anchor knowledge and employ a closed-loop reasoning process.
arXiv Detail & Related papers (2023-11-18T18:10:02Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.