DSAC-C: Constrained Maximum Entropy for Robust Discrete Soft-Actor Critic
- URL: http://arxiv.org/abs/2310.17173v2
- Date: Mon, 23 Jun 2025 02:45:04 GMT
- Title: DSAC-C: Constrained Maximum Entropy for Robust Discrete Soft-Actor Critic
- Authors: Dexter Neo, Tsuhan Chen,
- Abstract summary: We present a novel extension to the family of Soft Actor-Critic (SAC) algorithms.<n>We argue that based on the Maximum Entropy Principle, discrete SAC can be further improved via additional statistical constraints derived from a surrogate critic policy.
- Score: 0.20718016474717196
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a novel extension to the family of Soft Actor-Critic (SAC) algorithms. We argue that based on the Maximum Entropy Principle, discrete SAC can be further improved via additional statistical constraints derived from a surrogate critic policy. Furthermore, our findings suggests that these constraints provide an added robustness against potential domain shifts, which are essential for safe deployment of reinforcement learning agents in the real-world. We provide theoretical analysis and show empirical results on low data regimes for both in-distribution and out-of-distribution variants of Atari 2600 games.
Related papers
- DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty [21.542065840791683]
Deep reinforcement learning (RL) has achieved significant success, yet its application in real-world scenarios is often hindered by a lack of robustness to environmental uncertainties.<n>We propose Distributionally Robust Soft Actor-Critic (DR-SAC), a novel algorithm designed to enhance the robustness of the state-of-the-art Soft Actor-Critic (SAC) algorithm.
arXiv Detail & Related papers (2025-06-14T20:36:44Z) - Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.
Current approaches typically address this issue through online sampling from the target policy.
We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Tighter Performance Theory of FedExProx [85.92481138826949]
We revisit FedExProx - a recently proposed distributed optimization method designed to enhance convergence properties of parallel algorithms via extrapolation.
We develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems.
We extend the applicability of our analysis to general functions satisfying the Polyak-Lojasiewicz condition, outperforming the previous strongly convex analysis.
arXiv Detail & Related papers (2024-10-20T11:53:25Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Risk-sensitive Markov Decision Process and Learning under General Utility Functions [3.069335774032178]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations.
We consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a decision process (MDP)
We propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward.
In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z) - SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization [1.3597551064547502]
This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization.
In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints.
Our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards.
arXiv Detail & Related papers (2023-11-01T22:12:50Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Probabilistic Constraint for Safety-Critical Reinforcement Learning [13.502008069967552]
We consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL)
We provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE.
We propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies.
arXiv Detail & Related papers (2023-06-29T19:41:56Z) - PACER: A Fully Push-forward-based Distributional Reinforcement Learning Algorithm [28.48626438603237]
PACER consists of a distributional critic, an actor and a sample-based encourager.
Push-forward operator is leveraged in both the critic and actor to model the return distributions and policies respectively.
A sample-based utility value policy gradient is established for the push-forward policy update.
arXiv Detail & Related papers (2023-06-11T09:45:31Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z) - Regularizing Variational Autoencoder with Diversity and Uncertainty
Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z) - Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint.
We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it.
Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z) - DSAC: Distributional Soft Actor-Critic for Risk-Sensitive Reinforcement Learning [25.658136792327358]
We present Distributional Soft Actor-Critic (DSAC) algorithm that combines the strengths of distributional information of accumulated rewards and entropy-driven exploration.<n>DSAC models the randomness in both action and rewards, surpassing baseline performances on various continuous control tasks.<n>Experiments demonstrate DSAC's effectiveness in enhancing agent performances for both risk-neutral and risk-sensitive control tasks.
arXiv Detail & Related papers (2020-04-30T02:23:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.