Outcome-based Exploration for LLM Reasoning
- URL: http://arxiv.org/abs/2509.06941v1
- Date: Mon, 08 Sep 2025 17:52:56 GMT
- Title: Outcome-based Exploration for LLM Reasoning
- Authors: Yuda Song, Julia Kempe, Remi Munos,
- Abstract summary: Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models.<n>We show that RL can reduce effective diversity even on the training set relative to the base model.<n>We propose outcome-based exploration, which assigns exploration bonuses according to final outcomes.
- Score: 18.33816564983908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.
Related papers
- Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z) - SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z) - TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization [32.17940023097263]
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval.<n>Current reinforcement learning (RL) frameworks for search-augmented reasoning rely on sparse outcome-level rewards.<n>We propose Turn-level Stage-aware Policy Optimization (TSPO) to address this problem.
arXiv Detail & Related papers (2026-01-30T09:58:45Z) - The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling [39.65138471548881]
Reinforcement learning (RL) has been pivotal in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose SESA, a novel SEquential SAmpling framework that generates diverse solution sketches sequentially before expanding them into full reasoning paths.<n>Our experiments on a synthetic task show that sequential sampling consistently outperforms traditional RL methods in terms of path diversity and recovery from collapse.
arXiv Detail & Related papers (2025-10-17T10:15:11Z) - Representation-Based Exploration for Language Models: From Test-Time to Post-Training [50.144031964319424]
Reinforcement learning (RL) promises to expand the capabilities of language models.<n>It is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model.<n>We investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors.
arXiv Detail & Related papers (2025-10-13T17:49:05Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective [52.38531288378491]
reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs)<n>In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction.<n>Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration.
arXiv Detail & Related papers (2025-09-26T17:39:48Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - Diversity-Aware Policy Optimization for Large Language Model Reasoning [30.460540027658173]
We investigate the impact of diversity in RL-based training for large language models.<n>We propose a novel diversity-aware policy optimization method.<n>Our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-05-29T13:27:44Z) - Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning [55.36978389831446]
We recast reflective exploration within the Bayes-Adaptive RL framework.<n>Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on observed outcomes.
arXiv Detail & Related papers (2025-05-26T22:51:00Z) - Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration [33.807927649100805]
Reinforcement learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs)<n>RL approaches face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration.<n>We propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR)<n>i-MENTOR introduces three key innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies; dynamic reward scaling to stabilize exploration and exploitation in large action spaces; and advantage-preserving reward implementation that maintains
arXiv Detail & Related papers (2025-05-23T08:30:28Z) - LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization [30.95342819013663]
Large language models (LLMs) have demonstrated impressive capabilities in reasoning.<n>Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches.<n>We propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG.
arXiv Detail & Related papers (2025-05-23T04:04:05Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - BADDr: Bayes-Adaptive Deep Dropout RL for POMDPs [22.78390558602203]
We present a representation-agnostic formulation of BRL under partially observability, unifying the previous models under one theoretical umbrella.
We also propose a novel derivation, Bayes-Adaptive Deep Dropout rl (BADDr), based on dropout networks.
arXiv Detail & Related papers (2022-02-17T19:48:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.