Related papers: Optimistic Exploration even with a Pessimistic Initialisation

Optimistic Exploration even with a Pessimistic Initialisation

URL: http://arxiv.org/abs/2002.12174v1
Date: Wed, 26 Feb 2020 17:15:53 GMT
Title: Optimistic Exploration even with a Pessimistic Initialisation
Authors: Tabish Rashid, Bei Peng, Wendelin B\"ohmer, Shimon Whiteson
Abstract summary: Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL) In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network.
Score: 57.41327865257504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrapping. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs.

Related papers

Natural Evolutionary Search meets Probabilistic Numerics [24.753011922443513]
We introduce a novel class of algorithms, termed Probabilistic Natural Evolutionary Strategy Algorithms (ProbNES)<n>We show that ProbNES algorithms consistently outperforms their non-probabilistic counterparts as well as global sample efficient methods.
arXiv Detail & Related papers (2025-07-09T21:15:50Z)
Understanding Inverse Reinforcement Learning under Overparameterization: Non-Asymptotic Analysis and Global Optimality [52.906438147288256]
We show that our algorithm can identify the globally optimal reward and policy under certain neural network structures. This is the first IRL algorithm with a non-asymptotic convergence guarantee that provably achieves global optimality.
arXiv Detail & Related papers (2025-03-22T21:16:08Z)
Minimax Optimal Reinforcement Learning with Quasi-Optimism [9.410437324336275]
We introduce EQO (Exploration via Quasi-Optimism) as a new type of reinforcement learning algorithm. It avoids reliance on empirical variances and employs a simple bonus term proportional to the inverse of the state-action visit count. It consistently outperforms existing algorithms in both regret performance and computational efficiency.
arXiv Detail & Related papers (2025-03-02T09:32:06Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models. We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
Large Language Models Can Help Mitigate Barren Plateaus [2.384873896423002]
Quantum Neural Networks (QNNs) have emerged as a promising approach for various applications, yet their training is often hindered by barren plateaus (BPs) We propose a new Large Language Model (LLM)-driven search framework, AdaInit, that iteratively searches for optimal initial parameters of QNNs to maximize gradient variance and therefore mitigate BPs.
arXiv Detail & Related papers (2025-02-17T05:57:15Z)
Merit-Based Sortition in Decentralized Systems [0.0]
We introduce a simple algorithm for'merit-based sortition' We show that our algorithm boosts the quality metric describing the performance of the active set by $>2$ times the intrinsicity. This implies that merit-based sortition ensures a statistically significant performance boost to the drafted, 'active' set.
arXiv Detail & Related papers (2024-11-11T19:00:31Z)
Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z)
Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO) Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z)
Pseudo-Likelihood Inference [16.934708242852558]
Pseudo-Likelihood Inference (PLI) is a new method that brings neural approximation into ABC, making it competitive on challenging Bayesian system identification tasks. PLI allows for optimizing neural posteriors via gradient descent, does not rely on summary statistics, and enables multiple observations as input. The effectiveness of PLI is evaluated on four classical SBI benchmark tasks and on a highly dynamic physical system.
arXiv Detail & Related papers (2023-11-28T10:17:52Z)
Pointer Networks with Q-Learning for Combinatorial Optimization [55.2480439325792]
We introduce the Pointer Q-Network (PQN), a hybrid neural architecture that integrates model-free Q-value policy approximation with Pointer Networks (Ptr-Nets) Our empirical results demonstrate the efficacy of this approach, also testing the model in unstable environments.
arXiv Detail & Related papers (2023-11-05T12:03:58Z)
Training-Free Neural Active Learning with Initialization-Robustness Guarantees [27.38525683635627]
We introduce our expected variance with Gaussian processes (EV-GP) criterion for neural active learning. Our EV-GP criterion is training-free, i.e., it does not require any training of the NN during data selection.
arXiv Detail & Related papers (2023-06-07T14:28:42Z)
A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience. We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z)
Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z)
From Understanding Genetic Drift to a Smart-Restart Mechanism for Estimation-of-Distribution Algorithms [16.904475483445452]
We develop a smart-restart mechanism for Estimation-of-distribution algorithms (EDAs) By stopping runs when the risk for genetic drift is high, it automatically runs the EDA in good parameter regimes. We show that the smart-restart mechanism finds much better values for the population size than those suggested in the literature.
arXiv Detail & Related papers (2022-06-18T02:46:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.