Related papers: Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

URL: http://arxiv.org/abs/2311.18703v4
Date: Sun, 02 Feb 2025 19:19:53 GMT
Title: Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization
Authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier Alonso-Mora,
Abstract summary: In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors.<n>We propose a novel method to induce predictable behavior in RL agents, termed Predictability-Aware RL (PARL)<n>Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability.
Score: 16.335645061396455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors, and are often pushed (through e.g. policy entropy regularisation) to randomise their actions in favor of exploration. This often makes it challenging for other agents and humans to predict an agent's behavior, triggering unsafe scenarios (e.g. in human-robot interaction). We propose a novel method to induce predictable behavior in RL agents, termed Predictability-Aware RL (PARL), employing the agent's trajectory entropy rate to quantify predictability. Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability. We show how the entropy rate can be formally cast as an average reward, how entropy-rate value functions can be estimated from a learned model and incorporate this in policy-gradient algorithms, and demonstrate how this approach produces predictable (near-optimal) policies in tasks inspired by human-robot use-cases.

Related papers

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z)
MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention [81.56607128684723]
We introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. MereQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function.
arXiv Detail & Related papers (2024-06-24T01:51:09Z)
The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough [40.82741665804367]
We study a simple approach of maximizing the entropy over observations in place true latent states. We show how knowledge of the latter can be exploited to compute a regularization of the observation entropy to improve principled performance.
arXiv Detail & Related papers (2024-06-18T17:00:13Z)
Predicting AI Agent Behavior through Approximation of the Perron-Frobenius Operator [4.076790923976287]
We treat AI agents as nonlinear dynamical systems and adopt a probabilistic perspective to predict their statistical behavior. We formulate the approximation of the Perron-Frobenius (PF) operator as an entropy minimization problem. Our data-driven methodology simultaneously approximates the PF operator to perform prediction of the evolution of the agents and also predicts the terminal probability density of AI agents.
arXiv Detail & Related papers (2024-06-04T19:06:49Z)
Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning [6.937243101289336]
entropy-minimizing and entropy-maximizing objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments. We propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes.
arXiv Detail & Related papers (2024-05-27T14:58:24Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Model Predictive Control with Gaussian-Process-Supported Dynamical Constraints for Autonomous Vehicles [82.65261980827594]
We propose a model predictive control approach for autonomous vehicles that exploits learned Gaussian processes for predicting human driving behavior. A multi-mode predictive control approach considers the possible intentions of the human drivers.
arXiv Detail & Related papers (2023-03-08T17:14:57Z)
Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration. In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z)
Do You Need the Entropy Reward (in Practice)? [29.811723497181486]
It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies. This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC) Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
arXiv Detail & Related papers (2022-01-28T21:43:21Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized. The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Maximizing Information Gain in Partially Observable Environments via Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent. We derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.