Anti-Exploration by Random Network Distillation
- URL: http://arxiv.org/abs/2301.13616v2
- Date: Wed, 17 May 2023 12:23:26 GMT
- Title: Anti-Exploration by Random Network Distillation
- Authors: Alexander Nikulin, Vladislav Kurenkov, Denis Tarasov, Sergey
Kolesnikov
- Abstract summary: We show that a naive choice of conditioning for the Random Network Distillation (RND) is not discriminative enough to be used as an uncertainty estimator.
We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM)
We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
- Score: 63.04360288089277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success of Random Network Distillation (RND) in various domains,
it was shown as not discriminative enough to be used as an uncertainty
estimator for penalizing out-of-distribution actions in offline reinforcement
learning. In this paper, we revisit these results and show that, with a naive
choice of conditioning for the RND prior, it becomes infeasible for the actor
to effectively minimize the anti-exploration bonus and discriminativity is not
an issue. We show that this limitation can be avoided with conditioning based
on Feature-wise Linear Modulation (FiLM), resulting in a simple and efficient
ensemble-free algorithm based on Soft Actor-Critic. We evaluate it on the D4RL
benchmark, showing that it is capable of achieving performance comparable to
ensemble-based methods and outperforming ensemble-free approaches by a wide
margin.
Related papers
- Batch Ensemble for Variance Dependent Regret in Stochastic Bandits [41.95653110232677]
Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL)
Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that achieves near-optimal regret for Multi-Armed Bandits (MAB)
Our algorithm has just a single parameter namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses.
arXiv Detail & Related papers (2024-09-13T06:40:56Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Reward Certification for Policy Smoothed Reinforcement Learning [14.804252729195513]
Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas.
Recent studies have introduced "smoothed policies" in order to enhance its robustness.
It is still challenging to establish a provable guarantee to certify the bound of its total reward.
arXiv Detail & Related papers (2023-12-11T15:07:58Z) - Observation-Guided Diffusion Probabilistic Models [41.749374023639156]
We propose a novel diffusion-based image generation method called the observation-guided diffusion probabilistic model (OGDM)
Our approach reestablishes the training objective by integrating the guidance of the observation process with the Markov chain.
We demonstrate the effectiveness of our training algorithm using diverse inference techniques on strong diffusion model baselines.
arXiv Detail & Related papers (2023-10-06T06:29:06Z) - STEEL: Singularity-aware Reinforcement Learning [14.424199399139804]
Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy.
We propose a new batch RL algorithm that allows for singularity for both state and action spaces.
By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T18:29:35Z) - Mitigating Algorithmic Bias with Limited Annotations [65.060639928772]
When sensitive attributes are not disclosed or available, it is needed to manually annotate a small part of the training data to mitigate bias.
We propose Active Penalization Of Discrimination (APOD), an interactive framework to guide the limited annotations towards maximally eliminating the effect of algorithmic bias.
APOD shows comparable performance to fully annotated bias mitigation, which demonstrates that APOD could benefit real-world applications when sensitive information is limited.
arXiv Detail & Related papers (2022-07-20T16:31:19Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z) - Discriminative Adversarial Search for Abstractive Summarization [29.943949944682196]
We introduce a novel approach for sequence decoding, Discriminative Adversarial Search (DAS)
DAS has the desirable properties of alleviating the effects of exposure bias without requiring external metrics.
We investigate the effectiveness of the proposed approach on the task of Abstractive Summarization.
arXiv Detail & Related papers (2020-02-24T17:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.