Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
- URL: http://arxiv.org/abs/2510.16614v2
- Date: Thu, 23 Oct 2025 04:29:49 GMT
- Title: Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
- Authors: Xuan Zhang, Ruixiao Li, Zhijian Zhou, Long Li, Yulei Qin, Ke Li, Xing Sun, Xiaoyu Tan, Chao Qu, Yuan Qi,
- Abstract summary: We introduce MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward.<n>We show that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions.
- Score: 33.42935710088259
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (LLMs). However, prevalent RL paradigms still lean on sparse outcome-based rewards and limited exploration, which often drives LLMs toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for LLM reasoning and introduce MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count-based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while preserving the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.
Related papers
- Reinforced Efficient Reasoning via Semantically Diverse Exploration [73.41112984160992]
Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs)<n>We propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs.<n>Our method incorporates a semantic-entropy-based branching strategy and an $varepsilon$-exploration mechanism.
arXiv Detail & Related papers (2026-01-08T15:56:44Z) - Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z) - Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs [112.40801692473723]
Balancing exploration and exploitation is a central goal in reinforcement learning (RL)<n>We introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term.<n>Our method achieves significant gains on the Pass@K metric, even when evaluated with extremely large K values.
arXiv Detail & Related papers (2025-06-17T17:54:03Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models [22.796496516709514]
This paper provides a systematic review of recent advances in reinforcement learning (RL)-based reasoning for Multimodal Large Language Models (MLLMs)<n>We highlight two main RL paradigms, value-model-free and value-model-based methods, and analyze how RL enhances reasoning abilities by optimizing reasoning trajectories and aligning multimodal information.<n>We provide an extensive overview of benchmark datasets, evaluation protocols, and current limitations, and propose future research directions to address challenges such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment constraints.
arXiv Detail & Related papers (2025-04-30T03:14:28Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement [85.08223786819532]
Existing large language models (LLMs) show exceptional problem-solving capabilities but might struggle with complex reasoning tasks.<n>We propose textbfRAG-Star, a novel RAG approach that integrates retrieved information to guide the tree-based deliberative reasoning process.<n>Our experiments involving Llama-3.1-8B-Instruct and GPT-4o demonstrate that RAG-Star significantly outperforms previous RAG and reasoning methods.
arXiv Detail & Related papers (2024-12-17T13:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.