Related papers: Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

URL: http://arxiv.org/abs/2311.18703v3
Date: Mon, 19 Feb 2024 12:52:32 GMT
Title: Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization
Authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier Alonso-Mora
Abstract summary: In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors. We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL) We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy.
Score: 17.845518684835913
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors, and are often pushed (through e.g. policy entropy regularization) to randomize their actions in favor of exploration. From a human perspective, this makes RL agents hard to interpret and predict, and from a safety perspective, even harder to formally verify. We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL), which employs the state sequence entropy rate as a predictability measure. We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy enabling the use of policy-gradient methods. We prove that deterministic policies minimising the average surrogate reward exist and also minimize the actual entropy rate, and show how, given a learned dynamical model, we are able to approximate the value function associated to the true entropy rate. Finally, we demonstrate the effectiveness of the approach in RL tasks inspired by human-robot use-cases, and show how it produces agents with more predictable behavior while achieving near-optimal rewards.

Related papers

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention [81.56607128684723]
We introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. MereQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function.
arXiv Detail & Related papers (2024-06-24T01:51:09Z)
The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough [40.82741665804367]
We study a simple approach of maximizing the entropy over observations in place true latent states. We show how knowledge of the latter can be exploited to compute a regularization of the observation entropy to improve principled performance.
arXiv Detail & Related papers (2024-06-18T17:00:13Z)
Predicting AI Agent Behavior through Approximation of the Perron-Frobenius Operator [4.076790923976287]
We treat AI agents as nonlinear dynamical systems and adopt a probabilistic perspective to predict their statistical behavior. We formulate the approximation of the Perron-Frobenius (PF) operator as an entropy minimization problem. Our data-driven methodology simultaneously approximates the PF operator to perform prediction of the evolution of the agents and also predicts the terminal probability density of AI agents.
arXiv Detail & Related papers (2024-06-04T19:06:49Z)
Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning [6.937243101289336]
entropy-minimizing and entropy-maximizing objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments. We propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes.
arXiv Detail & Related papers (2024-05-27T14:58:24Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Model Predictive Control with Gaussian-Process-Supported Dynamical Constraints for Autonomous Vehicles [82.65261980827594]
We propose a model predictive control approach for autonomous vehicles that exploits learned Gaussian processes for predicting human driving behavior. A multi-mode predictive control approach considers the possible intentions of the human drivers.
arXiv Detail & Related papers (2023-03-08T17:14:57Z)
Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration. In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z)
Do You Need the Entropy Reward (in Practice)? [29.811723497181486]
It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies. This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC) Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
arXiv Detail & Related papers (2022-01-28T21:43:21Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized. The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Maximizing Information Gain in Partially Observable Environments via Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent. We derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.