Risk-averse Batch Active Inverse Reward Design
- URL: http://arxiv.org/abs/2311.12004v1
- Date: Mon, 20 Nov 2023 18:36:10 GMT
- Title: Risk-averse Batch Active Inverse Reward Design
- Authors: Panagiotis Liampas
- Abstract summary: Active Inverse Reward Design (AIRD) proposed the use of a series of queries, comparing possible reward functions in a single training environment.
It ignores the possibility of unknown features appearing in real-world environments, and the safety measures needed until the agent completely learns the reward function.
I improved this method and created Risk-averse Batch Active Inverse Reward Design (RBAIRD), which constructs batches, sets of environments the agent encounters when being used in the real world, processes them sequentially, and, for a predetermined number of iterations, asks queries that the human needs to answer for each environment of the batch.
RB
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Designing a perfect reward function that depicts all the aspects of the
intended behavior is almost impossible, especially generalizing it outside of
the training environments. Active Inverse Reward Design (AIRD) proposed the use
of a series of queries, comparing possible reward functions in a single
training environment. This allows the human to give information to the agent
about suboptimal behaviors, in order to compute a probability distribution over
the intended reward function. However, it ignores the possibility of unknown
features appearing in real-world environments, and the safety measures needed
until the agent completely learns the reward function. I improved this method
and created Risk-averse Batch Active Inverse Reward Design (RBAIRD), which
constructs batches, sets of environments the agent encounters when being used
in the real world, processes them sequentially, and, for a predetermined number
of iterations, asks queries that the human needs to answer for each environment
of the batch. After this process is completed in one batch, the probabilities
have been improved and are transferred to the next batch. This makes it capable
of adapting to real-world scenarios and learning how to treat unknown features
it encounters for the first time. I also integrated a risk-averse planner,
similar to that of Inverse Reward Design (IRD), which samples a set of reward
functions from the probability distribution and computes a trajectory that
takes the most certain rewards possible. This ensures safety while the agent is
still learning the reward function, and enables the use of this approach in
situations where cautiousness is vital. RBAIRD outperformed the previous
approaches in terms of efficiency, accuracy, and action certainty, demonstrated
quick adaptability to new, unknown features, and can be more widely used for
the alignment of crucial, powerful AI models.
Related papers
- No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Generalized Differentiable RANSAC [95.95627475224231]
$nabla$-RANSAC is a differentiable RANSAC that allows learning the entire randomized robust estimation pipeline.
$nabla$-RANSAC is superior to the state-of-the-art in terms of accuracy while running at a similar speed to its less accurate alternatives.
arXiv Detail & Related papers (2022-12-26T15:13:13Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Efficient Exploration of Reward Functions in Inverse Reinforcement
Learning via Bayesian Optimization [43.51553742077343]
inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration.
This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions consistent with the expert demonstrations.
arXiv Detail & Related papers (2020-11-17T10:17:45Z) - Bayesian Robust Optimization for Imitation Learning [34.40385583372232]
Inverse reinforcement learning can enable generalization to new states by learning a parameterized reward function.
Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework.
BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors.
arXiv Detail & Related papers (2020-07-24T01:52:11Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.