Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks
- URL: http://arxiv.org/abs/2211.11869v4
- Date: Sat, 27 Apr 2024 12:14:48 GMT
- Title: Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks
- Authors: Anton Dereventsov, Andrew Starnes, Clayton G. Webster,
- Abstract summary: This effort is focused on examining the behavior of reinforcement learning systems in personalization environments.
We provide a wide range of numerical experiments as well as theoretical justification to show that these differences in entropy are due to the type of learning being employed.
- Score: 0.40964539027092917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This effort is focused on examining the behavior of reinforcement learning systems in personalization environments and detailing the differences in policy entropy associated with the type of learning algorithm utilized. We demonstrate that Policy Optimization agents often possess low-entropy policies during training, which in practice results in agents prioritizing certain actions and avoiding others. Conversely, we also show that Q-Learning agents are far less susceptible to such behavior and generally maintain high-entropy policies throughout training, which is often preferable in real-world applications. We provide a wide range of numerical experiments as well as theoretical justification to show that these differences in entropy are due to the type of learning being employed.
Related papers
- Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning [1.0762853848552156]
We implement over 60 different off-policy agents, each integrating established regularization techniques from recent state-of-the-art algorithms.
We tested these agents across 14 diverse tasks from 2 simulation benchmarks, measuring training metrics related to overestimation, overfitting, and plasticity loss.
A simple Soft Actor-Critic agent, appropriately regularized, reliably finds a better-performing policy within the training regime.
arXiv Detail & Related papers (2024-03-01T13:25:10Z) - Invariant Causal Imitation Learning for Generalizable Policies [87.51882102248395]
We propose Invariant Causal Learning (ICIL) to learn an imitation policy.
ICIL learns a representation of causal features that is disentangled from the specific representations of noise variables.
We show that ICIL is effective in learning imitation policies capable of generalizing to unseen environments.
arXiv Detail & Related papers (2023-11-02T16:52:36Z) - Increasing Entropy to Boost Policy Gradient Performance on
Personalization Tasks [0.46040036610482665]
We consider the impact of regularization on the diversity of actions taken by policies generated from reinforcement learning agents trained using a policy gradient.
numerical evidence is given to show that policy regularization increases performance without losing accuracy.
arXiv Detail & Related papers (2023-10-09T01:03:05Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training.
We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z) - Influencing Long-Term Behavior in Multiagent Reinforcement Learning [59.98329270954098]
We propose a principled framework for considering the limiting policies of other agents as the time approaches infinity.
Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will take on.
Thanks to our farsighted evaluation, we demonstrate better long-term performance than state-of-the-art baselines in various domains.
arXiv Detail & Related papers (2022-03-07T17:32:35Z) - Instance based Generalization in Reinforcement Learning [24.485597364200824]
We analyze policy learning in the context of Partially Observable Markov Decision Processes (POMDPs)
We prove that, independently of the exploration strategy, reusing instances introduces significant changes on the effective Markov dynamics the agent observes during training.
We propose training a shared belief representation over an ensemble of specialized policies, from which we compute a consensus policy that is used for data collection, disallowing instance specific exploitation.
arXiv Detail & Related papers (2020-11-02T16:19:44Z) - Dynamics Generalization via Information Bottleneck in Deep Reinforcement
Learning [90.93035276307239]
We propose an information theoretic regularization objective and an annealing-based optimization method to achieve better generalization ability in RL agents.
We demonstrate the extreme generalization benefits of our approach in different domains ranging from maze navigation to robotic tasks.
This work provides a principled way to improve generalization in RL by gradually removing information that is redundant for task-solving.
arXiv Detail & Related papers (2020-08-03T02:24:20Z) - Invariant Policy Optimization: Towards Stronger Generalization in
Reinforcement Learning [5.476958867922322]
A fundamental challenge in reinforcement learning is to learn policies that generalize beyond the operating domains experienced during training.
We propose a novel learning algorithm, Invariant Policy Optimization (IPO), that implements this principle and learns an invariant policy during training.
arXiv Detail & Related papers (2020-06-01T17:28:19Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.