Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning
- URL: http://arxiv.org/abs/2501.17827v1
- Date: Wed, 29 Jan 2025 18:18:00 GMT
- Title: Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning
- Authors: Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, Doina Precup,
- Abstract summary: Langevin Soft Actor Critic (LSAC) prioritizes enhancing critic learning through uncertainty estimation over policy optimization.
LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks.
Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces.
- Score: 33.42657871152637
- License:
- Abstract: Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample efficiency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based $Q$ updates, parallel tempering for exploring multiple modes of the posterior of the $Q$ function, and diffusion synthesized state-action samples regularized with $Q$ action gradients. Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces.
Related papers
- VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling.
We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays.
We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z) - Distributional Soft Actor-Critic with Three Refinements [47.46661939652862]
Reinforcement learning (RL) has shown remarkable success in solving complex decision-making and control tasks.
Many model-free RL algorithms experience performance degradation due to inaccurate value estimation.
This paper introduces three key refinements to DSACv1 to overcome these limitations and further improve Q-value estimation accuracy.
arXiv Detail & Related papers (2023-10-09T16:52:48Z) - RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End
Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation.
RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set.
Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z) - Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic [42.57662196581823]
Learning high-quality $Q$-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms.
Deviating from the common viewpoint, we observe that $Q$-values are often underestimated in the latter stage of the RL training process.
We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates $Q$-value using both historical best-performing actions and the current policy.
arXiv Detail & Related papers (2023-06-05T13:38:14Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement
Learning from Suboptimal Demonstrations [17.08814685657957]
Monte Carlo Augmented Actor Critic (MCAC) is a parameter free modification to standard actor-critic algorithms.
MCAC computes a modified $Q$-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go.
Experiments across $5$ continuous control domains suggest that MCAC can be used to significantly increase learning efficiency across $6$ commonly used RL and RL-from-demonstrations algorithms.
arXiv Detail & Related papers (2022-10-14T00:23:37Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.