Related papers: From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning

From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning

URL: http://arxiv.org/abs/2602.03201v2
Date: Tue, 10 Feb 2026 07:16:29 GMT
Title: From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning
Authors: Yao-Hui Li, Zeyu Wang, Xin Li, Wei Pang, Yingfang Yuan, Zhengkun Chen, Boya Zhang, Riashat Islam, Alex Lamb, Yonggang Zhang,
Abstract summary: Shaping Landscapes with Optimistic Potential Estimates (SLOPE) is a novel framework that shifts reward modeling from predicting scalars to constructing informative potential landscapes.<n>SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients.<n> Evaluations on 30+ tasks across 5 benchmarks demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.
Score: 22.59885243102632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model-based reinforcement learning (MBRL) achieves high sample efficiency by simulating future trajectories with learned dynamics and reward models. However, its effectiveness is severely compromised in sparse reward settings. The core limitation lies in the standard paradigm of regressing ground-truth scalar rewards: in sparse environments, this yields a flat, gradient-free landscape that fails to provide directional guidance for planning. To address this challenge, we propose Shaping Landscapes with Optimistic Potential Estimates (SLOPE), a novel framework that shifts reward modeling from predicting scalars to constructing informative potential landscapes. SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients. Evaluations on 30+ tasks across 5 benchmarks demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.

Related papers

MARS: Margin-Aware Reward-Modeling with Self-Refinement [30.002638947792416]
Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF.<n>We propose an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model.<n>We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improve conditioning.
arXiv Detail & Related papers (2026-02-19T18:59:03Z)
Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z)
Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning [12.864604506942294]
We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration.<n>OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss.<n>We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM.
arXiv Detail & Related papers (2026-02-10T18:11:00Z)
Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning [3.6333725470852443]
We explore how Monte Carlo Tree Search can be repurposed to improve policy optimization in preference-based reinforcement learning.<n>We propose a staged GRPO training paradigm where completions are derived from partially revealed MCTS rollouts, introducing a novel tree-structured setting for advantage estimation.<n>Our initial results indicate that while structured advantage estimation can stabilize and better reflect reasoning quality, challenges such as advantage saturation and reward signal collapse remain.
arXiv Detail & Related papers (2025-09-11T09:18:07Z)
Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model [11.28895057233897]
Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring.<n>We introduce Marginal Gain Advantage Estimation, which refines the reward advantage function.<n>Second, we develop a Semantic Dynamics Model based on patchified water semantic masks.<n>Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, cost estimator, and SDM for cost advantage estimation.
arXiv Detail & Related papers (2025-08-13T17:39:09Z)
GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction [35.36975133932852]
Trajectory prediction for surrounding agents is a challenging task in autonomous driving.<n>We introduce a novel Graph-oriented Inverse Reinforcement Learning framework, which is an IRL-based predictor equipped with vectorized context representations.<n>Our approach achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks.
arXiv Detail & Related papers (2025-06-26T09:46:53Z)
LARES: Latent Reasoning for Sequential Recommendation [96.26996622771593]
We present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation.<n>Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity.<n>Our framework exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.
arXiv Detail & Related papers (2025-05-22T16:22:54Z)
DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Acting upon Imagination: when to trust imagined trajectories in model based reinforcement learning [1.26990070983988]
Model-based reinforcement learning (MBRL) aims to learn model(s) of the environment dynamics that can predict the outcome of its actions. We propose uncertainty estimation methods for online evaluation of imagined trajectories. Results highlight significant reduction on computational costs without sacrificing performance.
arXiv Detail & Related papers (2021-05-12T15:04:07Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.