A Study of Global and Episodic Bonuses for Exploration in Contextual
MDPs
- URL: http://arxiv.org/abs/2306.03236v1
- Date: Mon, 5 Jun 2023 20:45:30 GMT
- Title: A Study of Global and Episodic Bonuses for Exploration in Contextual
MDPs
- Authors: Mikael Henaff, Minqi Jiang, Roberta Raileanu
- Abstract summary: We show that episodic bonuses are most effective when there is little shared structure across episodes.
We also find that combining the two bonuses can lead to more robust performance across different degrees of shared structure.
This results in an algorithm which sets a new state of the art across 16 tasks from the MiniHack suite used in prior work.
- Score: 21.31346761487944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploration in environments which differ across episodes has received
increasing attention in recent years. Current methods use some combination of
global novelty bonuses, computed using the agent's entire training experience,
and \textit{episodic novelty bonuses}, computed using only experience from the
current episode. However, the use of these two types of bonuses has been ad-hoc
and poorly understood. In this work, we shed light on the behavior of these two
types of bonuses through controlled experiments on easily interpretable tasks
as well as challenging pixel-based settings. We find that the two types of
bonuses succeed in different settings, with episodic bonuses being most
effective when there is little shared structure across episodes and global
bonuses being effective when more structure is shared. We develop a conceptual
framework which makes this notion of shared structure precise by considering
the variance of the value function across contexts, and which provides a
unifying explanation of our empirical results. We furthermore find that
combining the two bonuses can lead to more robust performance across different
degrees of shared structure, and investigate different algorithmic choices for
defining and combining global and episodic bonuses based on function
approximation. This results in an algorithm which sets a new state of the art
across 16 tasks from the MiniHack suite used in prior work, and also performs
robustly on Habitat and Montezuma's Revenge.
Related papers
- Reward Augmentation in Reinforcement Learning for Testing Distributed Systems [6.0560257343687995]
Bugs in popular distributed protocol implementations have been the source of many downtimes in popular internet services.
We describe a randomized testing approach for distributed protocol implementations based on reinforcement learning.
We show two different techniques that build on one another.
arXiv Detail & Related papers (2024-09-02T15:07:05Z) - Rewarded soups: towards Pareto-optimal alignment by interpolating
weights fine-tuned on diverse rewards [101.7246658985579]
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data.
We propose embracing the heterogeneity of diverse rewards by following a multi-policy strategy.
We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks.
arXiv Detail & Related papers (2023-06-07T14:58:15Z) - Explore to Generalize in Zero-Shot RL [38.43215023828472]
We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task.
We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far effective generalization, yielding a success rate of $83%$ on the Maze task and $74%$ on Heist with $200$ training levels.
arXiv Detail & Related papers (2023-06-05T17:49:43Z) - Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning [18.762198598488066]
EMAX is a framework to seamlessly extend value-based MARL algorithms with ensembles of value functions.
EMAX uses the uncertainty of value estimates across the ensemble in a UCB policy to guide the exploration.
During optimisation, EMAX computes target values as average value estimates across the ensemble.
During evaluation, EMAX selects actions following a majority vote across the ensemble, which reduces the likelihood of selecting sub-optimal actions.
arXiv Detail & Related papers (2023-02-07T12:51:20Z) - Reward Bonuses with Gain Scheduling Inspired by Iterative Deepening
Search [8.071506311915396]
This paper introduces a novel method of adding intrinsic bonuses to task-oriented reward function in order to efficiently facilitate reinforcement learning search.
Various bonuses have been designed to date, they are analogous to the depth-first and breadth-first search algorithms in graph theory.
A gain scheduling is applied to the designed bonuses, inspired by the iterative deepening search, which is known to inherit the advantages of the two search algorithms.
arXiv Detail & Related papers (2022-12-21T04:52:13Z) - Contextual Bandits for Advertising Campaigns: A Diffusion-Model
Independent Approach (Extended Version) [73.59962178534361]
We study an influence problem in which little is assumed to be known about the diffusion network or about the model that determines how information may propagate.
In this setting, an explore-exploit approach could be used to learn the key underlying diffusion parameters, while running the campaign.
We describe and compare two methods of contextual multi-armed bandits, with upper-confidence bounds on the remaining potential of influencers.
arXiv Detail & Related papers (2022-01-13T22:06:10Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Bayesian decision-making under misspecified priors with applications to
meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits.
We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z) - Combinatorial Pure Exploration with Bottleneck Reward Function and its
Extension to General Reward Functions [13.982295536546728]
We study the Combinatorial Pure Exploration problem with the bottleneck reward function (CPE-B) under the fixed-confidence and fixed-budget settings.
We present both fixed-confidence and fixed-budget algorithms, and provide the sample complexity lower bound for the fixed-confidence setting.
In addition, we extend CPE-B to general reward functions (CPE-G) and propose the first fixed-confidence algorithm for general non-linear reward functions with non-trivial sample complexity.
arXiv Detail & Related papers (2021-02-24T06:47:51Z) - Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit
Feedback [51.21673420940346]
Combinatorial bandits generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set.
We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set.
Based on a projection-free online learning algorithm for finite polytopes, it is the first computationally efficient algorithm which is convexally optimal and has competitive empirical performance.
arXiv Detail & Related papers (2021-01-21T10:35:09Z) - Never Give Up: Learning Directed Exploration Strategies [63.19616370038824]
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies.
We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies.
A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control.
arXiv Detail & Related papers (2020-02-14T13:57:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.