Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty
- URL: http://arxiv.org/abs/2601.00737v1
- Date: Fri, 02 Jan 2026 16:33:17 GMT
- Title: Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty
- Authors: Uğurcan Özalp,
- Abstract summary: Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor)<n>Current methods employ ensembling to quantify the critic's uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates.<n>In this work, we propose a new algorithm called Actor-C (STAC) that incorporates temporal (one) aleatoric uncertainty-uncertainty arising from transitions, rewards, and policy-induced variability in Bellman.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor). This design typically achieves higher sample efficiency than purely on-policy methods. However, critic networks tend to overestimate value estimates systematically. This is often addressed by introducing a pessimistic bias based on uncertainty estimates. Current methods employ ensembling to quantify the critic's epistemic uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates. In this work, we propose a new algorithm called Stochastic Actor-Critic (STAC) that incorporates temporal (one-step) aleatoric uncertainty-uncertainty arising from stochastic transitions, rewards, and policy-induced variability in Bellman targets-to scale pessimistic bias in temporal-difference updates, rather than relying on epistemic uncertainty. STAC uses a single distributional critic network to model the temporal return uncertainty, and applies dropout to both the critic and actor networks for regularization. Our results show that pessimism based on a distributional critic alone suffices to mitigate overestimation, and naturally leads to risk-averse behavior in stochastic environments. Introducing dropout further improves training stability and performance by means of regularization. With this design, STAC achieves improved computational efficiency using a single distributional critic network.
Related papers
- Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic [7.536387580547838]
We argue that policy non-smoothness is governed by the differential geometry of the critic.<n>We introduce PAVE, a critic-centric regularization framework.<n>PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature.
arXiv Detail & Related papers (2026-01-30T13:32:52Z) - Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization [4.784045060345404]
This work introduces enhanced methods to achieve flexible bias control and stronger representation learning.<n>We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors.<n>To further improve performance, we integrate augmented state and action representations into the actor and critic networks.
arXiv Detail & Related papers (2025-11-20T06:31:55Z) - TULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation [11.334867025651233]
We propose TULiP, a theoretically-driven uncertainty estimator for OOD detection.<n>Our approach considers a hypothetical perturbation applied to the network before convergence.<n>Our method exhibits state-of-the-art performance, particularly for near-distribution samples.
arXiv Detail & Related papers (2025-05-22T17:16:41Z) - Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning [12.490614705930676]
We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation.<n>Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching.<n>We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches.
arXiv Detail & Related papers (2024-06-06T13:58:41Z) - Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise [51.87307904567702]
Quantile regression is a leading approach for obtaining such intervals via the empirical estimation of quantiles in the distribution of outputs.<n>We propose Relaxed Quantile Regression (RQR), a direct alternative to quantile regression based interval construction that removes this arbitrary constraint.<n>We demonstrate that this added flexibility results in intervals with an improvement in desirable qualities.
arXiv Detail & Related papers (2024-06-05T13:36:38Z) - A Case for Validation Buffer in Pessimistic Actor-Critic [1.5022206231191775]
We show that the critic approximation error can be approximated via a fixed-point model similar to that of the Bellman value.
We propose Validation Pessimism Learning (VPL) algorithm to retrieve the conditions under which the pessimistic critic is unbiased.
VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized.
arXiv Detail & Related papers (2024-03-01T22:24:11Z) - Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets.
We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z) - The Implicit Delta Method [61.36121543728134]
In this paper, we propose an alternative, the implicit delta method, which works by infinitesimally regularizing the training loss of uncertainty.
We show that the change in the evaluation due to regularization is consistent for the variance of the evaluation estimator, even when the infinitesimal change is approximated by a finite difference.
arXiv Detail & Related papers (2022-11-11T19:34:17Z) - Some Supervision Required: Incorporating Oracle Policies in
Reinforcement Learning via Epistemic Uncertainty Metrics [2.56865487804497]
Critic Confidence Guided Exploration takes in the policy's actions as suggestions and incorporates this information into the learning scheme.
We show that CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy.
arXiv Detail & Related papers (2022-08-22T18:26:43Z) - Learning Pessimism for Robust and Efficient Off-Policy Reinforcement
Learning [0.0]
Off-policy deep reinforcement learning algorithms compensate for overestimation bias during temporal-difference learning.
In this work, we propose a novel learnable penalty to enact such pessimism.
We also propose to learn the penalty alongside the critic with dual TD-learning, a strategy to estimate and minimize the bias magnitude in the target returns.
arXiv Detail & Related papers (2021-10-07T12:13:19Z) - Quantifying Uncertainty in Deep Spatiotemporal Forecasting [67.77102283276409]
We describe two types of forecasting problems: regular grid-based and graph-based.
We analyze UQ methods from both the Bayesian and the frequentist point view, casting in a unified framework via statistical decision theory.
Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical computational trade-offs for different UQ methods.
arXiv Detail & Related papers (2021-05-25T14:35:46Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Evaluating probabilistic classifiers: Reliability diagrams and score
decompositions revisited [68.8204255655161]
We introduce the CORP approach, which generates provably statistically Consistent, Optimally binned, and Reproducible reliability diagrams in an automated way.
Corpor is based on non-parametric isotonic regression and implemented via the Pool-adjacent-violators (PAV) algorithm.
arXiv Detail & Related papers (2020-08-07T08:22:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.