RRO: LLM Agent Optimization Through Rising Reward Trajectories
- URL: http://arxiv.org/abs/2505.20737v1
- Date: Tue, 27 May 2025 05:27:54 GMT
- Title: RRO: LLM Agent Optimization Through Rising Reward Trajectories
- Authors: Zilong Wang, Jingfeng Yang, Sreyashi Nag, Samarth Varshney, Xianfeng Tang, Haoming Jiang, Jingbo Shang, Sheikh Muhammad Sarwar,
- Abstract summary: Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks.<n>In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task.<n>We propose Reward Rising Optimization (RRO) to mitigate this issue.
- Score: 52.579992804584464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per-step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high-quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode-SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost.
Related papers
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents [43.806220882212386]
RLVMR integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors.<n>On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves new state-of-the-art results.
arXiv Detail & Related papers (2025-07-30T17:00:48Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning [45.10424242207931]
Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs)<n>We introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for query generation, evidence extraction, and answer generation.<n>With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers.
arXiv Detail & Related papers (2025-05-20T08:21:00Z) - Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs)<n>Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards.<n>This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive.<n>We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z) - Outcome-Refining Process Supervision for Code Generation [28.6680126802249]
Large Language Models struggle with complex programming tasks that require deep algorithmic reasoning.<n>We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised.<n>Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency.
arXiv Detail & Related papers (2024-12-19T17:59:42Z) - Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs)
PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs)
To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?"
We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z) - Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training.
Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z) - Let's Reinforce Step by Step [10.65244642965387]
We use Reinforcement Learning from Human Feedback to shape model reasoning processes.
Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning.
We also show the critical role reward aggregation functions play in model performance.
arXiv Detail & Related papers (2023-11-10T01:35:51Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.