Predictable Reinforcement Learning Dynamics through Entropy Rate
Minimization
- URL: http://arxiv.org/abs/2311.18703v3
- Date: Mon, 19 Feb 2024 12:52:32 GMT
- Title: Predictable Reinforcement Learning Dynamics through Entropy Rate
Minimization
- Authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier
Alonso-Mora
- Abstract summary: In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors.
We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL)
We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy.
- Score: 17.845518684835913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Reinforcement Learning (RL), agents have no incentive to exhibit
predictable behaviors, and are often pushed (through e.g. policy entropy
regularization) to randomize their actions in favor of exploration. From a
human perspective, this makes RL agents hard to interpret and predict, and from
a safety perspective, even harder to formally verify. We propose a novel method
to induce predictable behavior in RL agents, referred to as
Predictability-Aware RL (PA-RL), which employs the state sequence entropy rate
as a predictability measure. We show how the entropy rate can be formulated as
an average reward objective, and since its entropy reward function is
policy-dependent, we introduce an action-dependent surrogate entropy enabling
the use of policy-gradient methods. We prove that deterministic policies
minimising the average surrogate reward exist and also minimize the actual
entropy rate, and show how, given a learned dynamical model, we are able to
approximate the value function associated to the true entropy rate. Finally, we
demonstrate the effectiveness of the approach in RL tasks inspired by
human-robot use-cases, and show how it produces agents with more predictable
behavior while achieving near-optimal rewards.
Related papers
- MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention [81.56607128684723]
We introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention.
MereQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions.
It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function.
arXiv Detail & Related papers (2024-06-24T01:51:09Z) - Predicting AI Agent Behavior through Approximation of the Perron-Frobenius Operator [4.076790923976287]
We treat AI agents as nonlinear dynamical systems and adopt a probabilistic perspective to predict their statistical behavior.
We formulate the approximation of the Perron-Frobenius (PF) operator as an entropy minimization problem.
Our data-driven methodology simultaneously approximates the PF operator to perform prediction of the evolution of the agents and also predicts the terminal probability density of AI agents.
arXiv Detail & Related papers (2024-06-04T19:06:49Z) - Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning [6.937243101289336]
entropy-minimizing and entropy-maximizing objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments.
We propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem.
We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes.
arXiv Detail & Related papers (2024-05-27T14:58:24Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Model Predictive Control with Gaussian-Process-Supported Dynamical
Constraints for Autonomous Vehicles [82.65261980827594]
We propose a model predictive control approach for autonomous vehicles that exploits learned Gaussian processes for predicting human driving behavior.
A multi-mode predictive control approach considers the possible intentions of the human drivers.
arXiv Detail & Related papers (2023-03-08T17:14:57Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.