Variational Inference for Model-Free and Model-Based Reinforcement
Learning
- URL: http://arxiv.org/abs/2209.01693v1
- Date: Sun, 4 Sep 2022 21:03:14 GMT
- Title: Variational Inference for Model-Free and Model-Based Reinforcement
Learning
- Authors: Felix Leibfried
- Abstract summary: Variational inference (VI) is a type of approximate Bayesian inference that approximates an intractable posterior distribution with a tractable one.
Reinforcement learning (RL) on the other hand deals with autonomous agents and how to make them act optimally.
This manuscript shows how the apparently different subjects of VI and RL are linked in two fundamental ways.
- Score: 4.416484585765028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Variational inference (VI) is a specific type of approximate Bayesian
inference that approximates an intractable posterior distribution with a
tractable one. VI casts the inference problem as an optimization problem, more
specifically, the goal is to maximize a lower bound of the logarithm of the
marginal likelihood with respect to the parameters of the approximate
posterior. Reinforcement learning (RL) on the other hand deals with autonomous
agents and how to make them act optimally such as to maximize some notion of
expected future cumulative reward. In the non-sequential setting where agents'
actions do not have an impact on future states of the environment, RL is
covered by contextual bandits and Bayesian optimization. In a proper sequential
scenario, however, where agents' actions affect future states, instantaneous
rewards need to be carefully traded off against potential long-term rewards.
This manuscript shows how the apparently different subjects of VI and RL are
linked in two fundamental ways. First, the optimization objective of RL to
maximize future cumulative rewards can be recovered via a VI objective under a
soft policy constraint in both the non-sequential and the sequential setting.
This policy constraint is not just merely artificial but has proven as a useful
regularizer in many RL tasks yielding significant improvements in agent
performance. And second, in model-based RL where agents aim to learn about the
environment they are operating in, the model-learning part can be naturally
phrased as an inference problem over the process that governs environment
dynamics. We are going to distinguish between two scenarios for the latter: VI
when environment states are fully observable by the agent and VI when they are
only partially observable through an observation distribution.
Related papers
- ProSpec RL: Plan Ahead, then Execute [7.028937493640123]
We propose the Prospective (ProSpec) RL method, which makes higher-value, lower-risk optimal decisions by imagining future n-stream trajectories.
ProSpec employs a dynamic model to predict future states based on the current state and a series of sampled actions.
We validate the effectiveness of our method on the DMControl benchmarks, where our approach achieved significant performance improvements.
arXiv Detail & Related papers (2024-07-31T06:04:55Z) - A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning [48.59516337905877]
Learning a good representation is a crucial challenge for Reinforcement Learning (RL) agents.
Recent work has developed theoretical insights into these algorithms.
We take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective.
arXiv Detail & Related papers (2024-06-04T07:22:12Z) - Double Duality: Variational Primal-Dual Policy Optimization for
Constrained Reinforcement Learning [132.7040981721302]
We study the Constrained Convex Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure.
Design algorithms for a constrained convex MDP faces several challenges, including handling the large state space.
arXiv Detail & Related papers (2024-02-16T16:35:18Z) - SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries [94.84458417662407]
We introduce SAFE-SIM, a controllable closed-loop safety-critical simulation framework.
Our approach yields two distinct advantages: 1) generating realistic long-tail safety-critical scenarios that closely reflect real-world conditions, and 2) providing controllable adversarial behavior for more comprehensive and interactive evaluations.
We validate our framework empirically using the nuScenes and nuPlan datasets across multiple planners, demonstrating improvements in both realism and controllability.
arXiv Detail & Related papers (2023-12-31T04:14:43Z) - A Tractable Inference Perspective of Offline RL [36.563229330549284]
A popular paradigm for offline Reinforcement Learning (RL) tasks is to first fit the offline trajectories to a sequence model, and then prompt the model for actions that lead to high expected return.
This paper highlights that tractability, the ability to exactly and efficiently answer various probabilistic queries, plays an important role in offline RL.
We propose Trifle, which bridges the gap between good sequence models and high expected returns at evaluation time.
arXiv Detail & Related papers (2023-10-31T19:16:07Z) - STEEL: Singularity-aware Reinforcement Learning [14.424199399139804]
Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy.
We propose a new batch RL algorithm that allows for singularity for both state and action spaces.
By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T18:29:35Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - Regularizing Variational Autoencoder with Diversity and Uncertainty
Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z) - Foresee then Evaluate: Decomposing Value Estimation with Latent Future
Prediction [37.06232589005015]
Value function is the central notion of Reinforcement Learning (RL)
We propose Value Decomposition with Future Prediction (VDFP)
We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
arXiv Detail & Related papers (2021-03-03T07:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.