Related papers: Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

URL: http://arxiv.org/abs/2510.26109v1
Date: Thu, 30 Oct 2025 03:36:19 GMT
Title: Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error
Authors: Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Saiyong Yang, Yunfang Wu,
Abstract summary: LTE (Learning to reason from Trial and Error) is an approach hinting LLMs with their previously self-generated incorrect answers and problem of overlong responses.<n> Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base.
Score: 13.24687763539952
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of large language models (LLMs) recently. However, existing RLVR approaches merely train LLMs based on their own generated responses and are constrained by the initial capability of LLMs, thus prone to exploration stagnation, in which LLMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems but requires external guidance from experts which suffers from limited availability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach hinting LLMs with their previously self-generated incorrect answers and problem of overlong responses, which does not require any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base. Further analysis confirms that LTE successfully mitigates the problem of exploration stagnation and enhances both exploitation and exploration during training.

Related papers

Are LLMs The Way Forward? A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving [9.255259913388096]
Small, locally deployed Large Language Models (LLMs) can support autonomous highway driving through reward shaping rather than direct control.<n>We present a case study comparing RL-only, LLM-only, and hybrid approaches.<n>Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes.
arXiv Detail & Related papers (2025-11-16T19:31:42Z)
Guiding Exploration in Reinforcement Learning Through LLM-Augmented Observations [0.0]
Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining.<n>We propose a framework that provides LLM-generated action recommendations through augmented observation spaces.
arXiv Detail & Related papers (2025-10-09T19:54:31Z)
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [46.610146536866445]
Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities.<n>We proposeScaffolded Reinforcement Learning (RuscaRL) to break the exploration bottleneck for general reasoning.<n>We show that RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1.
arXiv Detail & Related papers (2025-08-23T08:47:31Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem [53.3188041952701]
We show that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs.<n>With just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks.<n>Results are comparable to or even surpass the results from RL with 20x less compute.
arXiv Detail & Related papers (2025-06-03T18:35:52Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning [68.57166425493283]
Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions.<n>This crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered.<n>We introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Tuning (CRaFT) to address this issue.
arXiv Detail & Related papers (2024-10-09T14:12:51Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Rethinking with Retrieval: Faithful Large Language Model Inference [91.66406351103484]
We propose a novel post-processing approach, rethinking with retrieval (RR) RR retrieves relevant external knowledge based on the reasoning steps obtained from the chain-of-thought prompting. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks.
arXiv Detail & Related papers (2022-12-31T22:35:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.