Related papers: SACn: Soft Actor-Critic with n-step Returns

SACn: Soft Actor-Critic with n-step Returns

URL: http://arxiv.org/abs/2512.13165v1
Date: Mon, 15 Dec 2025 10:23:13 GMT
Title: SACn: Soft Actor-Critic with n-step Returns
Authors: Jakub Łyskawa, Jakub Lewandowski, Paweł Wawrzyński,
Abstract summary: Soft Actor-Critic (SAC) is one of the most relevant off-policy online model-free reinforcement learning (RL) methods.<n>SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms.<n>In this work, we combine SAC with n-step returns in a way that overcomes this issue.
Score: 3.305353787222645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $τ$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.

Related papers

A Variance-Reduced Cubic-Regularized Newton for Policy Optimization [6.52142708235708]
Existing second-order methods often suffer from suboptimal sample complexity or unrealistic assumptions about importance sampling.<n>To overcome these limitations, we propose VR-CR-PN, a variance-regularized Newton-reduced estimator.<n>As an additional contribution, we introduce a novel horizon for the expected return function, allowing the algorithm to achieve a uniform sample complexity.
arXiv Detail & Related papers (2025-07-14T10:04:02Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Efficiently Escaping Saddle Points for Policy Optimization [43.636107996849375]
Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance.<n>We propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $tildeO(epsilon-3)$.
arXiv Detail & Related papers (2023-11-15T12:36:45Z)
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement. For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts. We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
Sparse Bayesian Learning via Stepwise Regression [1.2691047660244335]
We propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP) As its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression. We derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP.
arXiv Detail & Related papers (2021-06-11T00:20:27Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Incremental Without Replacement Sampling in Nonconvex Optimization [0.0]
Minibatch decomposition methods for empirical risk are commonly analysed in an approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analysing a versatile incremental gradient scheme.
arXiv Detail & Related papers (2020-07-15T09:17:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.