ContraBAR: Contrastive Bayes-Adaptive Deep RL
- URL: http://arxiv.org/abs/2306.02418v1
- Date: Sun, 4 Jun 2023 17:50:20 GMT
- Title: ContraBAR: Contrastive Bayes-Adaptive Deep RL
- Authors: Era Choshen, Aviv Tamar
- Abstract summary: In meta reinforcement learning (meta RL), an agent seeks a Bayes-optimal policy -- the optimal policy when facing an unknown task.
We investigate whether contrastive methods can be used for learning Bayes-optimal behavior.
We propose a simple meta RL algorithm that uses contrastive predictive coding (CPC) in lieu of variational belief inference.
- Score: 22.649531458557206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In meta reinforcement learning (meta RL), an agent seeks a Bayes-optimal
policy -- the optimal policy when facing an unknown task that is sampled from
some known task distribution. Previous approaches tackled this problem by
inferring a belief over task parameters, using variational inference methods.
Motivated by recent successes of contrastive learning approaches in RL, such as
contrastive predictive coding (CPC), we investigate whether contrastive methods
can be used for learning Bayes-optimal behavior. We begin by proving that
representations learned by CPC are indeed sufficient for Bayes optimality.
Based on this observation, we propose a simple meta RL algorithm that uses CPC
in lieu of variational belief inference. Our method, ContraBAR, achieves
comparable performance to state-of-the-art in domains with state-based
observation and circumvents the computational toll of future observation
reconstruction, enabling learning in domains with image-based observations. It
can also be combined with image augmentations for domain randomization and used
seamlessly in both online and offline meta RL settings.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - DPO: Differential reinforcement learning with application to optimal configuration search [3.2857981869020327]
Reinforcement learning with continuous state and action spaces remains one of the most challenging problems within the field.
We propose the first differential RL framework that can handle settings with limited training samples and short-length episodes.
arXiv Detail & Related papers (2024-04-24T03:11:12Z) - Probabilistic Inference in Reinforcement Learning Done Right [37.31057328219418]
A popular perspective in Reinforcement learning casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP)
Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference.
We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret.
arXiv Detail & Related papers (2023-11-22T10:23:14Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - Pareto Deterministic Policy Gradients and Its Application in 5G Massive
MIMO Networks [32.099949375036495]
We consider jointly optimizing cell load balance and network throughput via a reinforcement learning (RL) approach.
Our rationale behind using RL is to circumvent the challenges of analytically modeling user mobility and network dynamics.
To accomplish this joint optimization, we integrate vector rewards into the RL value network and conduct RL action via a separate policy network.
arXiv Detail & Related papers (2020-12-02T15:35:35Z) - Mixed Reinforcement Learning with Additive Stochastic Uncertainty [19.229447330293546]
Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency.
This paper presents a mixed RL algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy.
The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of non-affine nonlinear systems.
arXiv Detail & Related papers (2020-02-28T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.