Related papers: Is Q-learning an Ill-posed Problem?

Is Q-learning an Ill-posed Problem?

URL: http://arxiv.org/abs/2502.14365v1
Date: Thu, 20 Feb 2025 08:42:30 GMT
Title: Is Q-learning an Ill-posed Problem?
Authors: Philipp Wissmann, Daniel Hein, Steffen Udluft, Thomas Runkler,
Abstract summary: This paper investigates the instability of Q-learning in continuous environments.<n>We show that even in relatively simple benchmarks, the fundamental task of Q-learning can be inherently ill-posed and prone to failure.
Score: 2.4424095531386234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper investigates the instability of Q-learning in continuous environments, a challenge frequently encountered by practitioners. Traditionally, this instability is attributed to bootstrapping and regression model errors. Using a representative reinforcement learning benchmark, we systematically examine the effects of bootstrapping and model inaccuracies by incrementally eliminating these potential error sources. Our findings reveal that even in relatively simple benchmarks, the fundamental task of Q-learning - iteratively learning a Q-function from policy-specific target values - can be inherently ill-posed and prone to failure. These insights cast doubt on the reliability of Q-learning as a universal solution for reinforcement learning problems.

Related papers

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning [95.28059121743831]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks.<n>We introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation.<n>SwS enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models.
arXiv Detail & Related papers (2025-06-10T17:02:00Z)
Temporal-Difference Variational Continual Learning [89.32940051152782]
A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks. In Continual Learning settings, models often struggle to balance learning new tasks with retaining previous knowledge. We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.
arXiv Detail & Related papers (2024-10-10T10:58:41Z)
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering [63.12469700986452]
We introduce the concept of uncertainty-aware curriculum learning (CL) Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments.
arXiv Detail & Related papers (2024-01-03T02:29:34Z)
Suppressing Overestimation in Q-Learning through Adversarial Behaviors [4.36117236405564]
This paper proposes a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ) The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning.
arXiv Detail & Related papers (2023-10-10T03:46:32Z)
Offline Reinforcement Learning with On-Policy Q-Function Regularization [57.09073809901382]
We deal with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
arXiv Detail & Related papers (2023-07-25T21:38:08Z)
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization [51.01957350348377]
We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries. Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules. Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
arXiv Detail & Related papers (2022-10-13T05:23:38Z)
Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results. We present novel methods Q-Rex and Q-RexDaRe. We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z)
Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods. Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
Q-Learning with Differential Entropy of Q-Tables [4.221871357181261]
We conjecture that the reduction in performance during prolonged training sessions of Q-learning is caused by a loss of information. We introduce Differential Entropy of Q-tables (DE-QT) as an external information loss detector to the Q-learning algorithm.
arXiv Detail & Related papers (2020-06-26T04:37:10Z)
Self-Imitation Learning via Generalized Lower Bound Q-learning [23.65188248947536]
Self-imitation learning motivated by lower-bound Q-learning is a novel and effective approach for off-policy learning. We propose a n-step lower bound which generalizes the original return-based lower-bound Q-learning. We show that n-step lower bound Q-learning is a more robust alternative to return-based self-imitation learning and uncorrected n-step.
arXiv Detail & Related papers (2020-06-12T19:52:04Z)
ConQUR: Mitigating Delusional Bias in Deep Q-learning [45.21332566843924]
Delusional bias is a fundamental source of error in approximate Q-learning. We develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class.
arXiv Detail & Related papers (2020-02-27T19:22:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.