Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic
- URL: http://arxiv.org/abs/2601.22970v1
- Date: Fri, 30 Jan 2026 13:32:52 GMT
- Title: Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic
- Authors: Jeong Woon Lee, Kyoleen Kwak, Daeho Kim, Hyoseok Hwang,
- Abstract summary: We argue that policy non-smoothness is governed by the differential geometry of the critic.<n>We introduce PAVE, a critic-centric regularization framework.<n>PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature.
- Score: 7.536387580547838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness and robustness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.
Related papers
- Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z) - Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty [0.0]
Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor)<n>Current methods employ ensembling to quantify the critic's uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates.<n>In this work, we propose a new algorithm called Actor-C (STAC) that incorporates temporal (one) aleatoric uncertainty-uncertainty arising from transitions, rewards, and policy-induced variability in Bellman.
arXiv Detail & Related papers (2026-01-02T16:33:17Z) - Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z) - Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning [3.973277434105709]
We rigorously analyze a policy execution framework that samples actions from a policy at discrete time points and implements them as piecewise constant controls.<n>We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients according to the policy.<n>Building on these results, we analyze the bias and variance of various policy gradient estimators based on discrete-time observations.
arXiv Detail & Related papers (2025-03-13T02:35:23Z) - Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization [11.320660946946523]
Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms.<n>We show that performing on-policy reinforcement learning with an evidential critic provides both.<n>We name the resulting algorithm emphEvidential Proximal Policy Optimization (EPPO) due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages.
arXiv Detail & Related papers (2025-03-03T12:23:07Z) - Unlearning-based Neural Interpretations [51.99182464831169]
We show that current baselines defined using static functions are biased, fragile and manipulable.<n>We propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent.
arXiv Detail & Related papers (2024-10-10T16:02:39Z) - Mollification Effects of Policy Gradient Methods [16.617678267301702]
We develop a rigorous framework for understanding how policy gradient methods mollify non-smooth optimization landscapes.
We demonstrate the equivalence between policy gradient methods and solving backward heat equations.
We make the connection between this limitation and the uncertainty principle in harmonic analysis to understand the effects of exploration with policies in RL.
arXiv Detail & Related papers (2024-05-28T05:05:33Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Bounded Robustness in Reinforcement Learning via Lexicographic
Objectives [54.00072722686121]
Policy robustness in Reinforcement Learning may not be desirable at any cost.
We study how policies can be maximally robust to arbitrary observational noise.
We propose a robustness-inducing scheme, applicable to any policy algorithm, that trades off expected policy utility for robustness.
arXiv Detail & Related papers (2022-09-30T08:53:18Z) - A comment on stabilizing reinforcement learning [0.0]
We argue that Vamvoudakis et al. made a fallacious assumption on the Hamiltonian under a generic policy.
We show a neural network convergence under a continuous-weight-time environment, provided certain conditions on the behavior policy hold.
arXiv Detail & Related papers (2021-11-24T07:58:14Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.