Related papers: Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction

Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction

URL: http://arxiv.org/abs/2511.03836v1
Date: Wed, 05 Nov 2025 20:04:53 GMT
Title: Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction
Authors: Lipeng Zu, Hansong Zhou, Xiaonan Zhang,
Abstract summary: Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer.<n>SADQ integrates successor-state distributions into the Q-value estimation process.<n>We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance.
Score: 3.2883573376133555
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer. However, the target updates in DQN often rely on next states generated by actions from past, potentially suboptimal, policy. As a result, these states may not provide informative learning signals, causing high variance into the update process. This issue is exacerbated when the sampled transitions are poorly aligned with the agent's current policy. To address this limitation, we propose the Successor-state Aggregation Deep Q-Network (SADQ), which explicitly models environment dynamics using a stochastic transition model. SADQ integrates successor-state distributions into the Q-value estimation process, enabling more stable and policy-aligned value updates. Additionally, it explores a more efficient action selection strategy with the modeled transition structure. We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance. Our extensive empirical results across standard RL benchmarks and real-world vector-based control tasks demonstrate that SADQ consistently outperforms DQN variants in both stability and learning efficiency.

Related papers

In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior [53.21550098214227]
In-context reinforcement learning promises fast adaptation to unseen environments without parameter updates.<n>We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time.<n>We prove that SPICE achieves regret-optimal behaviour in both bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories.
arXiv Detail & Related papers (2026-01-06T13:41:31Z)
Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL) QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM) Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep QNetworks [2.781147009075454]
Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to surpass human level performance in a number of Atari learning environments. The unstable behaviour is often characterised by overestimation in the $Q$-values, commonly referred to as the overestimation bias. This paper proposes a new algorithm that dynamically varies the step size horizon in multi-step updates based on the similarity of states visited.
arXiv Detail & Related papers (2022-10-07T04:56:04Z)
Topological Experience Replay [22.84244156916668]
deep Q-learning methods update Q-values using state transitions sampled from the experience replay buffer. We organize the agent's experience into a graph that explicitly tracks the dependency between Q-values of states. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of goal-reaching tasks.
arXiv Detail & Related papers (2022-03-29T18:28:20Z)
Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks. estimation noise becomes a bias after the max operator in the policy improvement step. We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods. Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.