Optimistic Exploration even with a Pessimistic Initialisation
- URL: http://arxiv.org/abs/2002.12174v1
- Date: Wed, 26 Feb 2020 17:15:53 GMT
- Title: Optimistic Exploration even with a Pessimistic Initialisation
- Authors: Tabish Rashid, Bei Peng, Wendelin B\"ohmer, Shimon Whiteson
- Abstract summary: Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL)
In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values.
We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network.
- Score: 57.41327865257504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimistic initialisation is an effective strategy for efficient exploration
in reinforcement learning (RL). In the tabular case, all provably efficient
model-free algorithms rely on it. However, model-free deep RL algorithms do not
use optimistic initialisation despite taking inspiration from these provably
efficient tabular algorithms. In particular, in scenarios with only positive
rewards, Q-values are initialised at their lowest possible values due to
commonly used network initialisation schemes, a pessimistic initialisation.
Merely initialising the network to output optimistic Q-values is not enough,
since we cannot ensure that they remain optimistic for novel state-action
pairs, which is crucial for exploration. We propose a simple count-based
augmentation to pessimistically initialised Q-values that separates the source
of optimism from the neural network. We show that this scheme is provably
efficient in the tabular setting and extend it to the deep RL setting. Our
algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments
the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure
optimism during both action selection and bootstrapping. We show that OPIQ
outperforms non-optimistic DQN variants that utilise a pseudocount-based
intrinsic motivation in hard exploration tasks, and that it predicts optimistic
estimates for novel state-action pairs.
Related papers
- Merit-Based Sortition in Decentralized Systems [0.0]
We introduce a simple algorithm for'merit-based sortition'
We show that our algorithm boosts the quality metric describing the performance of the active set by $>2$ times the intrinsicity.
This implies that merit-based sortition ensures a statistically significant performance boost to the drafted, 'active' set.
arXiv Detail & Related papers (2024-11-11T19:00:31Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO)
Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z) - Pseudo-Likelihood Inference [16.934708242852558]
Pseudo-Likelihood Inference (PLI) is a new method that brings neural approximation into ABC, making it competitive on challenging Bayesian system identification tasks.
PLI allows for optimizing neural posteriors via gradient descent, does not rely on summary statistics, and enables multiple observations as input.
The effectiveness of PLI is evaluated on four classical SBI benchmark tasks and on a highly dynamic physical system.
arXiv Detail & Related papers (2023-11-28T10:17:52Z) - Pointer Networks with Q-Learning for Combinatorial Optimization [55.2480439325792]
We introduce the Pointer Q-Network (PQN), a hybrid neural architecture that integrates model-free Q-value policy approximation with Pointer Networks (Ptr-Nets)
Our empirical results demonstrate the efficacy of this approach, also testing the model in unstable environments.
arXiv Detail & Related papers (2023-11-05T12:03:58Z) - Training-Free Neural Active Learning with Initialization-Robustness
Guarantees [27.38525683635627]
We introduce our expected variance with Gaussian processes (EV-GP) criterion for neural active learning.
Our EV-GP criterion is training-free, i.e., it does not require any training of the NN during data selection.
arXiv Detail & Related papers (2023-06-07T14:28:42Z) - A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive
Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience.
We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - From Understanding Genetic Drift to a Smart-Restart Mechanism for
Estimation-of-Distribution Algorithms [16.904475483445452]
We develop a smart-restart mechanism for Estimation-of-distribution algorithms (EDAs)
By stopping runs when the risk for genetic drift is high, it automatically runs the EDA in good parameter regimes.
We show that the smart-restart mechanism finds much better values for the population size than those suggested in the literature.
arXiv Detail & Related papers (2022-06-18T02:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.