Does DQN Learn?
- URL: http://arxiv.org/abs/2205.13617v5
- Date: Tue, 17 Jun 2025 06:02:59 GMT
- Title: Does DQN Learn?
- Authors: Aditya Gopalan, Gugan Thoppe,
- Abstract summary: We numerically show that Deep Q-Network (DQN) often returns a policy that performs worse than the initial one.<n>We offer a theoretical explanation for this phenomenon in linear DQN.
- Score: 16.035744751431114
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: A primary requirement for any reinforcement learning method is that it should produce policies that improve upon the initial guess. In this work, we show that the widely used Deep Q-Network (DQN) fails to satisfy this minimal criterion -- even when it gets to see all possible states and actions infinitely often (a condition under which tabular Q-learning is guaranteed to converge to the optimal Q-value function). Our specific contributions are twofold. First, we numerically show that DQN often returns a policy that performs worse than the initial one. Second, we offer a theoretical explanation for this phenomenon in linear DQN, a simplified version of DQN that uses linear function approximation in place of neural networks while retaining the other key components such as $\epsilon$-greedy exploration, experience replay, and target network. Using tools from differential inclusion theory, we prove that the limit points of linear DQN correspond to fixed points of projected Bellman operators. Crucially, we show that these fixed points need not relate to optimal -- or even near-optimal -- policies, thus explaining linear DQN's sub-optimal behaviors. We also give a scenario where linear DQN always identifies the worst policy. Our work fills a longstanding gap in understanding the convergence behaviors of Q-learning with function approximation and $\epsilon$-greedy exploration.
Related papers
- On the Convergence and Sample Complexity Analysis of Deep Q-Networks
with $\epsilon$-Greedy Exploration [86.71396285956044]
This paper provides a theoretical understanding of Deep Q-Network (DQN) with the $varepsilon$-greedy exploration in deep reinforcement learning.
arXiv Detail & Related papers (2023-10-24T20:37:02Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - Extreme Q-Learning: MaxEnt RL without Entropy [88.97516083146371]
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains.
We introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT)
Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms.
arXiv Detail & Related papers (2023-01-05T23:14:38Z) - Elastic Step DQN: A novel multi-step algorithm to alleviate
overestimation in Deep QNetworks [2.781147009075454]
Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to surpass human level performance in a number of Atari learning environments.
The unstable behaviour is often characterised by overestimation in the $Q$-values, commonly referred to as the overestimation bias.
This paper proposes a new algorithm that dynamically varies the step size horizon in multi-step updates based on the similarity of states visited.
arXiv Detail & Related papers (2022-10-07T04:56:04Z) - Sampling Efficient Deep Reinforcement Learning through Preference-Guided
Stochastic Exploration [8.612437964299414]
We propose a preference-guided $epsilon$-greedy exploration algorithm for Deep Q-network (DQN)
We show that preference-guided exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration.
arXiv Detail & Related papers (2022-06-20T08:23:49Z) - Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment.
Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic.
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z) - Can Q-learning solve Multi Armed Bantids? [0.0]
We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems.
This stems from variance differences between policies, which causes two problems.
We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
arXiv Detail & Related papers (2021-10-21T07:08:30Z) - Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results.
We present novel methods Q-Rex and Q-RexDaRe.
We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - A Convergent and Efficient Deep Q Network Algorithm [3.553493344868414]
We show that the deep Q network (DQN) reinforcement learning algorithm can diverge and cease to operate in realistic settings.
We propose a convergent DQN algorithm (C-DQN) by carefully modifying DQN.
It learns robustly in difficult settings and can learn several difficult games in the Atari 2600 benchmark.
arXiv Detail & Related papers (2021-06-29T13:38:59Z) - Self-correcting Q-Learning [14.178899938667161]
We introduce a new way to address the bias in the form of a "self-correcting algorithm"
Applying this strategy to Q-learning results in Self-correcting Q-learning.
We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
arXiv Detail & Related papers (2020-12-02T11:36:24Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.