ConQUR: Mitigating Delusional Bias in Deep Q-learning
- URL: http://arxiv.org/abs/2002.12399v1
- Date: Thu, 27 Feb 2020 19:22:51 GMT
- Title: ConQUR: Mitigating Delusional Bias in Deep Q-learning
- Authors: Andy Su, Jayden Ooi, Tyler Lu, Dale Schuurmans, Craig Boutilier
- Abstract summary: Delusional bias is a fundamental source of error in approximate Q-learning.
We develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class.
- Score: 45.21332566843924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Delusional bias is a fundamental source of error in approximate Q-learning.
To date, the only techniques that explicitly address delusion require
comprehensive search using tabular value estimates. In this paper, we develop
efficient methods to mitigate delusional bias by training Q-approximators with
labels that are "consistent" with the underlying greedy policy class. We
introduce a simple penalization scheme that encourages Q-labels used across
training batches to remain (jointly) consistent with the expressible policy
class. We also propose a search framework that allows multiple Q-approximators
to be generated and tracked, thus mitigating the effect of premature (implicit)
policy commitments. Experimental results demonstrate that these methods can
improve the performance of Q-learning in a variety of Atari games, sometimes
dramatically.
Related papers
- Legitimate ground-truth-free metrics for deep uncertainty classification scoring [3.9599054392856483]
The use of Uncertainty Quantification (UQ) methods in production remains limited.
This limitation is exacerbated by the challenge of validating UQ methods in absence of UQ ground truth.
This paper investigates such metrics and proves that they are theoretically well-behaved and actually tied to some uncertainty ground truth.
arXiv Detail & Related papers (2024-10-30T14:14:32Z) - Regularized Q-learning through Robust Averaging [3.4354636842203026]
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner.
One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance.
We show that 2RA Q-learning converges to the optimal policy and analyze its theoretical mean-squared error.
arXiv Detail & Related papers (2024-05-03T15:57:26Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Suppressing Overestimation in Q-Learning through Adversarial Behaviors [4.36117236405564]
This paper proposes a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ)
The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning.
A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning.
arXiv Detail & Related papers (2023-10-10T03:46:32Z) - On the Estimation Bias in Double Q-Learning [20.856485777692594]
Double Q-learning is not fully unbiased and suffers from underestimation bias.
We show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operator.
We propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning.
arXiv Detail & Related papers (2021-09-29T13:41:24Z) - IQ-Learn: Inverse soft-Q Learning for Imitation [95.06031307730245]
imitation learning from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics.
Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence.
We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function.
arXiv Detail & Related papers (2021-06-23T03:43:10Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Q-Learning with Differential Entropy of Q-Tables [4.221871357181261]
We conjecture that the reduction in performance during prolonged training sessions of Q-learning is caused by a loss of information.
We introduce Differential Entropy of Q-tables (DE-QT) as an external information loss detector to the Q-learning algorithm.
arXiv Detail & Related papers (2020-06-26T04:37:10Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.