Admissible Policy Teaching through Reward Design
- URL: http://arxiv.org/abs/2201.02185v1
- Date: Thu, 6 Jan 2022 18:49:57 GMT
- Title: Admissible Policy Teaching through Reward Design
- Authors: Kiarash Banihashem, Adish Singla, Jiarui Gan, Goran Radanovic
- Abstract summary: We study reward design strategies for incentivizing a reinforcement learning agent to adopt a policy from a set of admissible policies.
The goal of the reward designer is to modify the underlying reward function cost-efficiently while ensuring that any approximately optimal deterministic policy under the new reward function is admissible.
- Score: 32.39785256112934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reward design strategies for incentivizing a reinforcement learning
agent to adopt a policy from a set of admissible policies. The goal of the
reward designer is to modify the underlying reward function cost-efficiently
while ensuring that any approximately optimal deterministic policy under the
new reward function is admissible and performs well under the original reward
function. This problem can be viewed as a dual to the problem of optimal reward
poisoning attacks: instead of forcing an agent to adopt a specific policy, the
reward designer incentivizes an agent to avoid taking actions that are
inadmissible in certain states. Perhaps surprisingly, and in contrast to the
problem of optimal reward poisoning attacks, we first show that the reward
design problem for admissible policy teaching is computationally challenging,
and it is NP-hard to find an approximately optimal reward modification. We then
proceed by formulating a surrogate problem whose optimal solution approximates
the optimal solution to the reward design problem in our setting, but is more
amenable to optimization techniques and analysis. For this surrogate problem,
we present characterization results that provide bounds on the value of the
optimal solution. Finally, we design a local search algorithm to solve the
surrogate problem and showcase its utility using simulation-based experiments.
Related papers
- Informativeness of Reward Functions in Reinforcement Learning [34.40155383189179]
We study the problem of designing informative reward functions so that the designed rewards speed up the agent's convergence.
Existing works have considered several different reward design formulations.
We propose a reward informativeness criterion that adapts w.r.t. the agent's current policy and can be optimized under specified structural constraints.
arXiv Detail & Related papers (2024-02-10T18:36:42Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Reward Learning as Doubly Nonparametric Bandits: Optimal Design and
Scaling Laws [22.099915149343957]
We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem.
We first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression.
We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate.
arXiv Detail & Related papers (2023-02-23T22:07:33Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Hindsight Reward Tweaking via Conditional Deep Reinforcement Learning [37.61951923445689]
We propose a novel paradigm for deep reinforcement learning to model the influences of reward functions within a near-optimal space.
We demonstrate the feasibility of this approach and study one of its potential application in policy performance boosting with multiple MuJoCo tasks.
arXiv Detail & Related papers (2021-09-06T10:06:48Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Outcome-Driven Reinforcement Learning via Variational Inference [95.82770132618862]
We discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards.
To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function.
We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.
arXiv Detail & Related papers (2021-04-20T18:16:21Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.