Deep Reinforcement Learning with Weighted Q-Learning
- URL: http://arxiv.org/abs/2003.09280v3
- Date: Mon, 13 Jun 2022 12:45:21 GMT
- Title: Deep Reinforcement Learning with Weighted Q-Learning
- Authors: Andrea Cini, Carlo D'Eramo, Jan Peters, Cesare Alippi
- Abstract summary: Reinforcement learning algorithms based on Q-learning are driving Deep Reinforcement Learning (DRL) research towards solving complex problems.
Q-Learning is known to be positively biased since it learns by using the maximum over noisy estimates of expected values.
We show how our novel Deep Weighted Q-Learning algorithm reduces the bias w.r.t. relevant baselines and provides empirical evidence of its advantages on representative benchmarks.
- Score: 43.823659028488876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning algorithms based on Q-learning are driving Deep
Reinforcement Learning (DRL) research towards solving complex problems and
achieving super-human performance on many of them. Nevertheless, Q-Learning is
known to be positively biased since it learns by using the maximum over noisy
estimates of expected values. Systematic overestimation of the action values
coupled with the inherently high variance of DRL methods can lead to
incrementally accumulate errors, causing learning algorithms to diverge.
Ideally, we would like DRL agents to take into account their own uncertainty
about the optimality of each action, and be able to exploit it to make more
informed estimations of the expected return. In this regard, Weighted
Q-Learning (WQL) effectively reduces bias and shows remarkable results in
stochastic environments. WQL uses a weighted sum of the estimated action
values, where the weights correspond to the probability of each action value
being the maximum; however, the computation of these probabilities is only
practical in the tabular setting. In this work, we provide methodological
advances to benefit from the WQL properties in DRL, by using neural networks
trained with Dropout as an effective approximation of deep Gaussian processes.
In particular, we adopt the Concrete Dropout variant to obtain calibrated
estimates of epistemic uncertainty in DRL. The estimator, then, is obtained by
taking several stochastic forward passes through the action-value network and
computing the weights in a Monte Carlo fashion. Such weights are Bayesian
estimates of the probability of each action value corresponding to the maximum
w.r.t. a posterior probability distribution estimated by Dropout. We show how
our novel Deep Weighted Q-Learning algorithm reduces the bias w.r.t. relevant
baselines and provides empirical evidence of its advantages on representative
benchmarks.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Regularized Q-learning through Robust Averaging [3.4354636842203026]
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner.
One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance.
We show that 2RA Q-learning converges to the optimal policy and analyze its theoretical mean-squared error.
arXiv Detail & Related papers (2024-05-03T15:57:26Z) - Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks [5.968716050740402]
This paper focuses on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks.
We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent.
Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity.
arXiv Detail & Related papers (2024-02-14T10:44:03Z) - Probabilistic Inference in Reinforcement Learning Done Right [37.31057328219418]
A popular perspective in Reinforcement learning casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP)
Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference.
We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret.
arXiv Detail & Related papers (2023-11-22T10:23:14Z) - Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels [57.46832672991433]
We propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS)
We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises.
We develop an expectation-propagation expectation-maximization algorithm for efficient posterior inference and function estimation.
arXiv Detail & Related papers (2023-10-09T03:55:09Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Estimation Error Correction in Deep Reinforcement Learning for
Deterministic Actor-Critic Methods [0.0]
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies.
We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises.
To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant.
arXiv Detail & Related papers (2021-09-22T13:49:35Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.