Related papers: Does DQN Learn?

Does DQN Learn?

URL: http://arxiv.org/abs/2205.13617v4
Date: Sat, 21 Sep 2024 04:58:24 GMT
Title: Does DQN Learn?
Authors: Aditya Gopalan, Gugan Thoppe,
Abstract summary: We show that the widely used Deep Q-Network (DQN) fails to meet even this basic criterion. We numerically show that DQN generally has a non-trivial probability of producing a policy worse than the initial one.
Score: 16.035744751431114
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For a reinforcement learning method to be useful, the policy it estimates in the limit must be superior to the initial guess, at least on average. In this work, we show that the widely used Deep Q-Network (DQN) fails to meet even this basic criterion, even when it gets to see all possible states and actions infinitely often (a condition that ensures tabular Q-learning's convergence to the optimal Q-value). Our work's key highlights are as follows. First, we numerically show that DQN generally has a non-trivial probability of producing a policy worse than the initial one. Second, we give a theoretical explanation for this behavior in the context of linear DQN, wherein we replace the neural network with a linear function approximation but retain DQN's other key ideas, such as experience replay, target network, and $\epsilon$-greedy exploration. Our main result is that the tail behaviors of linear DQN are governed by invariant sets of a deterministic differential inclusion, a set-valued generalization of a differential equation. Notably, we show that these invariant sets need not align with locally optimal policies, thus explaining DQN's pathological behaviors, such as convergence to sub-optimal policies and policy oscillation. We also provide a scenario where the limiting policy is always the worst. Our work addresses a longstanding gap in understanding the behaviors of Q-learning with function approximation and $\epsilon$-greedy exploration.

Related papers

On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $\epsilon$-Greedy Exploration [86.71396285956044]
This paper provides a theoretical understanding of Deep Q-Network (DQN) with the $varepsilon$-greedy exploration in deep reinforcement learning.
arXiv Detail & Related papers (2023-10-24T20:37:02Z)
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL) We show an analogous result for vanilla Q-functions under a soft margin condition. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z)
Extreme Q-Learning: MaxEnt RL without Entropy [88.97516083146371]
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains. We introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT) Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms.
arXiv Detail & Related papers (2023-01-05T23:14:38Z)
Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep QNetworks [2.781147009075454]
Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to surpass human level performance in a number of Atari learning environments. The unstable behaviour is often characterised by overestimation in the $Q$-values, commonly referred to as the overestimation bias. This paper proposes a new algorithm that dynamically varies the step size horizon in multi-step updates based on the similarity of states visited.
arXiv Detail & Related papers (2022-10-07T04:56:04Z)
Sampling Efficient Deep Reinforcement Learning through Preference-Guided Stochastic Exploration [8.612437964299414]
We propose a preference-guided $epsilon$-greedy exploration algorithm for Deep Q-network (DQN) We show that preference-guided exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration.
arXiv Detail & Related papers (2022-06-20T08:23:49Z)
Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z)
Can Q-learning solve Multi Armed Bantids? [0.0]
We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems. This stems from variance differences between policies, which causes two problems. We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
arXiv Detail & Related papers (2021-10-21T07:08:30Z)
Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results. We present novel methods Q-Rex and Q-RexDaRe. We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
A Convergent and Efficient Deep Q Network Algorithm [3.553493344868414]
We show that the deep Q network (DQN) reinforcement learning algorithm can diverge and cease to operate in realistic settings. We propose a convergent DQN algorithm (C-DQN) by carefully modifying DQN. It learns robustly in difficult settings and can learn several difficult games in the Atari 2600 benchmark.
arXiv Detail & Related papers (2021-06-29T13:38:59Z)
Self-correcting Q-Learning [14.178899938667161]
We introduce a new way to address the bias in the form of a "self-correcting algorithm" Applying this strategy to Q-learning results in Self-correcting Q-learning. We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
arXiv Detail & Related papers (2020-12-02T11:36:24Z)
Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.