Related papers: The Role of Target Update Frequencies in Q-Learning

The Role of Target Update Frequencies in Q-Learning

URL: http://arxiv.org/abs/2602.03911v1
Date: Tue, 03 Feb 2026 15:19:20 GMT
Title: The Role of Target Update Frequencies in Q-Learning
Authors: Simon Weissmann, Tilman Aach, Benedikt Wille, Sebastian Kassing, Leif Döring,
Abstract summary: The target network update frequency (TUF) is a central stabilization mechanism in (deep) Q-learning.<n>We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator.<n>We show that the optimal target update frequency increases geometrically over the course of the learning process.
Score: 4.76285598583384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The target network update frequency (TUF) is a central stabilization mechanism in (deep) Q-learning. However, their selection remains poorly understood and is often treated merely as another tunable hyperparameter rather than as a principled design decision. This work provides a theoretical analysis of target fixing in tabular Q-learning through the lens of approximate dynamic programming. We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator, approximated by a generic inner loop optimizer. Rigorous theory yields a finite-time convergence analysis for the asynchronous sampling setting, specializing to stochastic gradient descent in the inner loop. Our results deliver an explicit characterization of the bias-variance trade-off induced by the target update period, showing how to optimally set this critical hyperparameter. We prove that constant target update schedules are suboptimal, incurring a logarithmic overhead in sample complexity that is entirely avoidable with adaptive schedules. Our analysis shows that the optimal target update frequency increases geometrically over the course of the learning process.

Related papers

Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z)
FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer [30.184978506988767]
We introduce FISMO, which incorporates anisotropic neuralotropic geometry information through Fisher information geometry.<n> FISMO achieves superior efficiency and final performance compared to established baselines.
arXiv Detail & Related papers (2026-01-29T14:05:04Z)
Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z)
Reliable Optimization Under Noise in Quantum Variational Algorithms [0.05219568203653522]
We show that Variational Quantum Eigensolver is severely challenged by finite-shot sampling noise.<n>We identify adaptive metaheuristics as the most effective and resilient strategies.
arXiv Detail & Related papers (2025-11-11T14:21:43Z)
On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization [57.179679246370114]
A potential limitation of existing methods is the bias inherent in most perturbation estimators unless a stepsize is proposed.<n>We propose a novel family of unbiased gradient scaling estimators that eliminate bias while maintaining favorable construction.
arXiv Detail & Related papers (2025-10-22T18:25:43Z)
Beyond the Ideal: Analyzing the Inexact Muon Update [54.70108543057578]
We show first analysis of the inexactized update at Muon's core.<n>We reveal a fundamental coupling between this inexactness and the optimal step size and momentum.
arXiv Detail & Related papers (2025-10-22T18:01:07Z)
Natural Spectral Fusion: p-Exponent Cyclic Scheduling and Early Decision-Boundary Alignment in First-Order Optimization [11.323131201168572]
We propose Natural Spectral Fusion (NSF): reframing training as controllable spectral coverage and information fusion.<n>NSF has two core principles: treating the balances as a spectral controller that dynamically low- and high-frequency information.<n>We show that cyclic scheduling consistently reduces test error and demonstrates distinct convergence behavior.
arXiv Detail & Related papers (2025-09-05T00:00:00Z)
Understanding Optimization in Deep Learning with Central Flows [95.5647720254338]
We develop theory that can describe the dynamics of optimization in a complex regime.<n>Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.
arXiv Detail & Related papers (2024-10-31T17:58:13Z)
Fine-Tuning Adaptive Stochastic Optimizers: Determining the Optimal Hyperparameter $ε$ via Gradient Magnitude Histogram Analysis [0.7366405857677226]
We introduce a new framework based on the empirical probability density function of the loss's magnitude, termed the "gradient magnitude histogram" We propose a novel algorithm using gradient magnitude histograms to automatically estimate a refined and accurate search space for the optimal safeguard.
arXiv Detail & Related papers (2023-11-20T04:34:19Z)
Gaussian Process Bandit Optimization of the Thermodynamic Variational Objective [36.062939523856066]
This paper introduces a bespoke Gaussian process bandit optimization method for automatically choosing sorted discretization points. We provide theoretical guarantees that our bandit optimization converges to the regret-minimizing choice of integration points. Empirical validation of our algorithm is provided in terms of improved learning and inference in Variational Autoencoders and Sigmoid Belief Networks.
arXiv Detail & Related papers (2020-10-29T16:57:27Z)
Stochastic batch size for adaptive regularization in deep network optimization [63.68104397173262]
We propose a first-order optimization algorithm incorporating adaptive regularization applicable to machine learning problems in deep learning framework. We empirically demonstrate the effectiveness of our algorithm using an image classification task based on conventional network models applied to commonly used benchmark datasets.
arXiv Detail & Related papers (2020-04-14T07:54:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.