Reward-Conditioned Reinforcement Learning
- URL: http://arxiv.org/abs/2603.05066v1
- Date: Thu, 05 Mar 2026 11:29:17 GMT
- Title: Reward-Conditioned Reinforcement Learning
- Authors: Michal Nauman, Marek Cygan, Pieter Abbeel,
- Abstract summary: We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications.<n>RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy.<n>Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
- Score: 56.417273471201845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
Related papers
- Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics [7.115267332079192]
We propose a two-stage reward curriculum where we decouple task-specific objectives from behavioral terms.<n>In our method, we first train the agent on a simplified task-only reward function to ensure effective exploration.<n>We validate our approach on the DeepMind Control Suite, ManiSkill3, and a mobile robot environment, modified to include auxiliary behavioral objectives.
arXiv Detail & Related papers (2026-03-05T12:34:27Z) - Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z) - Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking [13.417125511014447]
We propose an automated framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences.<n>PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches.
arXiv Detail & Related papers (2025-10-14T23:18:24Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - Reset-free Reinforcement Learning with World Models [6.151562278670799]
We propose model-based reset-free (MoReFree) agent forReinforcement learning.<n>MoReFree adapts two key mechanisms, exploration and policy learning, to handle reset-free tasks.<n>It exhibits superior data-efficiency across various reset-free tasks without access to environmental reward or demonstrations.
arXiv Detail & Related papers (2024-08-19T08:56:00Z) - Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning [49.87923965553233]
Reinforcement Learning can lead to reward over-optimization in large language models.
We introduce the Reward from Demonstration (RCfD) to recalibrate the reward objective.
We show that RCfD achieves comparable performance to carefully tuned baselines while mitigating ROO.
arXiv Detail & Related papers (2024-04-30T09:57:21Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Teacher Forcing Recovers Reward Functions for Text Generation [21.186397113834506]
We propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing.
We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function.
arXiv Detail & Related papers (2022-10-17T02:48:58Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Hindsight Reward Tweaking via Conditional Deep Reinforcement Learning [37.61951923445689]
We propose a novel paradigm for deep reinforcement learning to model the influences of reward functions within a near-optimal space.
We demonstrate the feasibility of this approach and study one of its potential application in policy performance boosting with multiple MuJoCo tasks.
arXiv Detail & Related papers (2021-09-06T10:06:48Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.