Related papers: Latent State Marginalization as a Low-cost Approach for Improving Exploration

Latent State Marginalization as a Low-cost Approach for Improving Exploration

URL: http://arxiv.org/abs/2210.00999v1
Date: Mon, 3 Oct 2022 15:09:12 GMT
Title: Latent State Marginalization as a Low-cost Approach for Improving Exploration
Authors: Dinghuai Zhang, Aaron Courville, Yoshua Bengio, Qinqing Zheng, Amy Zhang, Ricky T. Q. Chen
Abstract summary: We propose the adoption of latent variable policies within the MaxEnt framework. We show that latent variable policies naturally emerges under the use of world models with a latent belief state. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.
Score: 79.12247903178934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.

Related papers

Uncertainty-Aware Strategies: A Model-Agnostic Framework for Robust Financial Optimization through Subsampling [0.7916373508978822]
This paper addresses the challenge of model uncertainty in quantitative finance.<n>Decisions in portfolio allocation, derivative pricing, and risk management rely on estimating models from limited data.<n>We superimpose an outer "uncertainty measure", motivated by traditional monetary risk measures, on the space of models.
arXiv Detail & Related papers (2025-06-08T21:55:00Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Probabilistic Inference in Reinforcement Learning Done Right [37.31057328219418]
A popular perspective in Reinforcement learning casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP) Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret.
arXiv Detail & Related papers (2023-11-22T10:23:14Z)
Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z)
Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z)
Achieving Fairness in Multi-Agent Markov Decision Processes Using Reinforcement Learning [30.605881670761853]
We propose a Reinforcement Learning approach to achieve fairness in finite-horizon episodic MDPs. We show that such an approach achieves sub-linear regret in terms of the number of episodes.
arXiv Detail & Related papers (2023-06-01T03:43:53Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner. We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z)
Reward Biased Maximum Likelihood Estimation for Reinforcement Learning [13.820705458648233]
Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed. We show that it has a regret of $mathcalO( log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms.
arXiv Detail & Related papers (2020-11-16T06:09:56Z)
Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions. We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel. We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.