Related papers: Reward Adaptation Via Q-Manipulation

Reward Adaptation Via Q-Manipulation

URL: http://arxiv.org/abs/2503.13414v1
Date: Mon, 17 Mar 2025 17:42:54 GMT
Title: Reward Adaptation Via Q-Manipulation
Authors: Kevin Vora, Yu Zhang,
Abstract summary: We propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors.<n>Our work represents a new approach to RA via the manipulation of Q-functions.<n>We refer to such a method as Q-Manipulation (Q-M)
Score: 3.8065968624597324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors learned a priori under the same domain dynamics but different reward functions. Learning the target behavior from scratch is possible but often inefficient given the available source behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. Assuming that the target reward function is a known function of the source reward functions, our approach to RA computes bounds of the Q function. We introduce an iterative process to tighten the bounds, similar to value iteration. This enables action pruning in the target domain before learning even starts. We refer to such a method as Q-Manipulation (Q-M). We formally prove that our pruning strategy does not affect the optimality of the returned policy while empirically show that it improves the sample complexity. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

Related papers

Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z)
Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z)
Improving the Effectiveness of Potential-Based Reward Shaping in Reinforcement Learning [0.5524804393257919]
We show how a simple linear shift of the potential function can be used to improve the effectiveness of reward shaping.<n>We show the theoretical limitations of continuous potential functions for correctly assigning positive and negative reward shaping values.
arXiv Detail & Related papers (2025-02-03T12:32:50Z)
Boosting Soft Q-Learning by Bounding [4.8748194765816955]
We show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance.
arXiv Detail & Related papers (2024-06-26T03:02:22Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Outcome-Driven Reinforcement Learning via Variational Inference [95.82770132618862]
We discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards. To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function. We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.
arXiv Detail & Related papers (2021-04-20T18:16:21Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers [138.68213707587822]
We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. We show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics.
arXiv Detail & Related papers (2020-06-24T17:47:37Z)
Quantifying Differences in Reward Functions [24.66221171351157]
We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy.
arXiv Detail & Related papers (2020-06-24T17:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.