Related papers: When Can Model-Free Reinforcement Learning be Enough for Thinking?

When Can Model-Free Reinforcement Learning be Enough for Thinking?

URL: http://arxiv.org/abs/2506.17124v1
Date: Fri, 20 Jun 2025 16:23:46 GMT
Title: When Can Model-Free Reinforcement Learning be Enough for Thinking?
Authors: Josiah P. Hanna, Nicholas E. Corrado,
Abstract summary: This paper builds a domain-independent understanding of when model-free RL will lead to "thinking" as a strategy for reward.<n>We show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act.<n>We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior.
Score: 3.5253513747455303
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a \textit{thought Markov decision process} (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.

Related papers

Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning [0.022940141855172035]
We propose a hybrid approach to machine Theory of Mind (ToM) that uses large language models (LLMs) as a mechanism for generating hypotheses and likelihood functions.<n>We also exhibit the model's potential to predict mental states on open-ended tasks.
arXiv Detail & Related papers (2025-07-04T16:01:27Z)
Deontically Constrained Policy Improvement in Reinforcement Learning Agents [0.0]
Markov Decision Processes (MDPs) are the most common model for decision making under uncertainty in the Machine Learning community.<n>An MDP captures non-determinism, probabilistic uncertainty, and an explicit model of action.<n>A Reinforcement Learning (RL) agent learns to act in an MDP by maximizing a utility function.
arXiv Detail & Related papers (2025-06-08T01:01:06Z)
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models [89.37819814048288]
We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy, and a diverse suite of tasks.<n>Our empirical analysis reveals that RL-trained models consistently outperform base resetting models across a wide range of pass@k evaluations.<n>These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models.
arXiv Detail & Related papers (2025-05-30T17:59:01Z)
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [45.33952788910874]
TON is a two-stage training strategy for vision-language models.<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z)
Enter the Void - Planning to Seek Entropy When Reward is Scarce [6.208654695856247]
We propose a novel approach that anticipates and actively seeks out high-entropy states using short-horizon latent predictions.<n>We present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the weighting between reward and entropy.<n>Our method finishes the Miniworld procedurally generated mazes 50% faster than base Dreamer at convergence and the policy trained in imagination converges in only 60% of the environment steps that base Dreamer needs.
arXiv Detail & Related papers (2025-05-22T15:28:50Z)
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models [50.4652276723694]
Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities.<n>Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%.
arXiv Detail & Related papers (2025-05-22T05:56:11Z)
Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z)
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models [50.97705264224828]
We propose Counterfactual Inception, a novel method that implants counterfactual thinking into Large Multi-modal Models. We aim for the models to engage with and generate responses that span a wider contextual scene understanding. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination.
arXiv Detail & Related papers (2024-03-20T11:27:20Z)
Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking. Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z)
Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA) Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space. We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.