Explore to Generalize in Zero-Shot RL
- URL: http://arxiv.org/abs/2306.03072v3
- Date: Mon, 15 Jan 2024 13:58:51 GMT
- Title: Explore to Generalize in Zero-Shot RL
- Authors: Ev Zisselman, Itai Lavie, Daniel Soudry, Aviv Tamar
- Abstract summary: We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task.
We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far effective generalization, yielding a success rate of $83%$ on the Maze task and $74%$ on Heist with $200$ training levels.
- Score: 38.43215023828472
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study zero-shot generalization in reinforcement learning-optimizing a
policy on a set of training tasks to perform well on a similar but unseen test
task. To mitigate overfitting, previous work explored different notions of
invariance to the task. However, on problems such as the ProcGen Maze, an
adequate solution that is invariant to the task visualization does not exist,
and therefore invariance-based approaches fail. Our insight is that learning a
policy that effectively $\textit{explores}$ the domain is harder to memorize
than a policy that maximizes reward for a specific task, and therefore we
expect such learned behavior to generalize well; we indeed demonstrate this
empirically on several domains that are difficult for invariance-based
approaches. Our $\textit{Explore to Generalize}$ algorithm (ExpGen) builds on
this insight: we train an additional ensemble of agents that optimize reward.
At test time, either the ensemble agrees on an action, and we generalize well,
or we take exploratory actions, which generalize well and drive us to a novel
part of the state space, where the ensemble may potentially agree again. We
show that our approach is the state-of-the-art on tasks of the ProcGen
challenge that have thus far eluded effective generalization, yielding a
success rate of $83\%$ on the Maze task and $74\%$ on Heist with $200$ training
levels. ExpGen can also be combined with an invariance based approach to gain
the best of both worlds, setting new state-of-the-art results on ProcGen.
Related papers
- $β$-DQN: Improving Deep Q-Learning By Evolving the Behavior [41.13282452752521]
$beta$-DQN is a simple and efficient exploration method that augments the standard DQN with a behavior function.
An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration.
Experiments on both simple and challenging exploration domains show that $beta$-DQN outperforms existing baseline methods.
arXiv Detail & Related papers (2025-01-01T18:12:18Z) - Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning [14.003793644193605]
In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards.
We introduce Temporal-Agent Reward Redistribution (TAR$2$), a novel approach designed to address the agent-temporal credit assignment problem.
TAR$2$ decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards.
arXiv Detail & Related papers (2024-12-19T12:05:13Z) - MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization [91.80034860399677]
Reinforcement learning algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards.
We introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration.
We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits.
arXiv Detail & Related papers (2024-12-16T18:59:53Z) - Active Ranking of Experts Based on their Performances in Many Tasks [72.96112117037465]
We consider the problem of ranking n experts based on their performances on d tasks.
We make a monotonicity assumption stating that for each pair of experts, one outperforms the other on all tasks.
arXiv Detail & Related papers (2023-06-05T06:55:39Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Influence-based Reinforcement Learning for Intrinsically-motivated
Agents [0.0]
We present an algorithmic framework of two reinforcement learning agents each with a different objective.
We introduce a novel function approximation approach to assess the influence $F$ of a certain policy on others.
Our method was evaluated on the suite of OpenAI gym tasks as well as cooperative and mixed scenarios.
arXiv Detail & Related papers (2021-08-28T05:36:10Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - Bandit Labor Training [2.28438857884398]
On-demand labor platforms aim to train a skilled workforce to serve its incoming demand for jobs.
Since limited jobs are available for training, and it is usually not necessary to train all workers, efficient matching of training jobs requires prioritizing fast learners over slow ones.
We show that any policy must incur an instance-dependent regret of $Omega(log T)$ and a worst-case regret of $Omega(K2/3)$.
arXiv Detail & Related papers (2020-06-11T21:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.