Explore to Generalize in Zero-Shot RL
- URL: http://arxiv.org/abs/2306.03072v3
- Date: Mon, 15 Jan 2024 13:58:51 GMT
- Title: Explore to Generalize in Zero-Shot RL
- Authors: Ev Zisselman, Itai Lavie, Daniel Soudry, Aviv Tamar
- Abstract summary: We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task.
We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far effective generalization, yielding a success rate of $83%$ on the Maze task and $74%$ on Heist with $200$ training levels.
- Score: 38.43215023828472
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study zero-shot generalization in reinforcement learning-optimizing a
policy on a set of training tasks to perform well on a similar but unseen test
task. To mitigate overfitting, previous work explored different notions of
invariance to the task. However, on problems such as the ProcGen Maze, an
adequate solution that is invariant to the task visualization does not exist,
and therefore invariance-based approaches fail. Our insight is that learning a
policy that effectively $\textit{explores}$ the domain is harder to memorize
than a policy that maximizes reward for a specific task, and therefore we
expect such learned behavior to generalize well; we indeed demonstrate this
empirically on several domains that are difficult for invariance-based
approaches. Our $\textit{Explore to Generalize}$ algorithm (ExpGen) builds on
this insight: we train an additional ensemble of agents that optimize reward.
At test time, either the ensemble agrees on an action, and we generalize well,
or we take exploratory actions, which generalize well and drive us to a novel
part of the state space, where the ensemble may potentially agree again. We
show that our approach is the state-of-the-art on tasks of the ProcGen
challenge that have thus far eluded effective generalization, yielding a
success rate of $83\%$ on the Maze task and $74\%$ on Heist with $200$ training
levels. ExpGen can also be combined with an invariance based approach to gain
the best of both worlds, setting new state-of-the-art results on ProcGen.
Related papers
- Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning [5.624791703748109]
We show that increased exploration during training can be leveraged to increase the generalisation performance of the agent.
We propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains.
arXiv Detail & Related papers (2024-06-12T10:39:31Z) - Active Ranking of Experts Based on their Performances in Many Tasks [72.96112117037465]
We consider the problem of ranking n experts based on their performances on d tasks.
We make a monotonicity assumption stating that for each pair of experts, one outperforms the other on all tasks.
arXiv Detail & Related papers (2023-06-05T06:55:39Z) - Improved Active Multi-Task Representation Learning via Lasso [44.607652031235716]
In this paper, we show the dominance of the L1-regularized-relevance-based ($nu1$) strategy by giving a lower bound for the $nu2$-based strategy.
We also characterize the potential of our $nu1$-based strategy in sample-cost-sensitive settings.
arXiv Detail & Related papers (2023-06-05T03:08:29Z) - Inverse Reinforcement Learning with the Average Reward Criterion [3.719493310637464]
We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion.
The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent.
arXiv Detail & Related papers (2023-05-24T01:12:08Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Influence-based Reinforcement Learning for Intrinsically-motivated
Agents [0.0]
We present an algorithmic framework of two reinforcement learning agents each with a different objective.
We introduce a novel function approximation approach to assess the influence $F$ of a certain policy on others.
Our method was evaluated on the suite of OpenAI gym tasks as well as cooperative and mixed scenarios.
arXiv Detail & Related papers (2021-08-28T05:36:10Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Bandit Labor Training [2.28438857884398]
On-demand labor platforms aim to train a skilled workforce to serve its incoming demand for jobs.
Since limited jobs are available for training, and it is usually not necessary to train all workers, efficient matching of training jobs requires prioritizing fast learners over slow ones.
We show that any policy must incur an instance-dependent regret of $Omega(log T)$ and a worst-case regret of $Omega(K2/3)$.
arXiv Detail & Related papers (2020-06-11T21:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.