Related papers: Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

URL: http://arxiv.org/abs/2503.13414v3
Date: Wed, 22 Oct 2025 17:22:42 GMT
Title: Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes
Authors: Kevin Vora, Yu Zhang,
Abstract summary: We propose a new solution to reward adaptation (RA) in reinforcement learning.<n>We introduce a new approach to RA through the manipulation of Q-functions.<n>We refer to this method as "Q-Manipulation" (Q-M)
Score: 2.9388795721577328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "Q-Manipulation" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

Related papers

Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z)
Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z)
Improving the Effectiveness of Potential-Based Reward Shaping in Reinforcement Learning [0.5524804393257919]
We show how a simple linear shift of the potential function can be used to improve the effectiveness of reward shaping.<n>We show the theoretical limitations of continuous potential functions for correctly assigning positive and negative reward shaping values.
arXiv Detail & Related papers (2025-02-03T12:32:50Z)
Proto Successor Measure: Representing the Behavior Space of an RL Agent [37.55496993803242]
"Zero-shot learning" is elusive for general-purpose reinforcement learning algorithms.<n>We present Proto Successor Measure: the basis set for all possible behaviors of a Reinforcement Learning Agent.<n>We derive a practical algorithm to learn these basis functions using reward-free interaction data from the environment.
arXiv Detail & Related papers (2024-11-29T00:09:39Z)
Boosting Soft Q-Learning by Bounding [4.8748194765816955]
We show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance.
arXiv Detail & Related papers (2024-06-26T03:02:22Z)
A Generalized Acquisition Function for Preference-based Reward Learning [12.158619866176487]
Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. We show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar.
arXiv Detail & Related papers (2024-03-09T20:32:17Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Online reinforcement learning with sparse rewards through an active inference capsule [62.997667081978825]
This paper introduces an active inference agent which minimizes the novel free energy of the expected future. Our model is capable of solving sparse-reward problems with a very high sample efficiency. We also introduce a novel method for approximating the prior model from the reward function, which simplifies the expression of complex objectives.
arXiv Detail & Related papers (2021-06-04T10:03:36Z)
Outcome-Driven Reinforcement Learning via Variational Inference [95.82770132618862]
We discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards. To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function. We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.
arXiv Detail & Related papers (2021-04-20T18:16:21Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers [138.68213707587822]
We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. We show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics.
arXiv Detail & Related papers (2020-06-24T17:47:37Z)
Quantifying Differences in Reward Functions [24.66221171351157]
We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy.
arXiv Detail & Related papers (2020-06-24T17:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.