Sample-Efficient and Safe Deep Reinforcement Learning via Reset Deep
Ensemble Agents
- URL: http://arxiv.org/abs/2310.20287v1
- Date: Tue, 31 Oct 2023 08:59:39 GMT
- Title: Sample-Efficient and Safe Deep Reinforcement Learning via Reset Deep
Ensemble Agents
- Authors: Woojun Kim, Yongjae Shin, Jongeui Park, Youngchul Sung
- Abstract summary: reset method performs periodic resets of a portion or the entirety of a deep RL agent while preserving the replay buffer.
We propose a new reset-based method that leverages deep ensemble learning to address the limitations of the vanilla reset method.
- Score: 17.96977778655143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning (RL) has achieved remarkable success in solving
complex tasks through its integration with deep neural networks (DNNs) as
function approximators. However, the reliance on DNNs has introduced a new
challenge called primacy bias, whereby these function approximators tend to
prioritize early experiences, leading to overfitting. To mitigate this primacy
bias, a reset method has been proposed, which performs periodic resets of a
portion or the entirety of a deep RL agent while preserving the replay buffer.
However, the use of the reset method can result in performance collapses after
executing the reset, which can be detrimental from the perspective of safe RL
and regret minimization. In this paper, we propose a new reset-based method
that leverages deep ensemble learning to address the limitations of the vanilla
reset method and enhance sample efficiency. The proposed method is evaluated
through various experiments including those in the domain of safe RL. Numerical
results show its effectiveness in high sample efficiency and safety
considerations.
Related papers
- ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling [57.91760520589592]
Scaling network depth has been a central driver behind the success of modern foundation models.<n>This paper revisits the default mechanism for deepening neural networks, namely residual connections.<n>We introduce adaptive neural connection reassignment (ANCRe), a principled and lightweight framework that parameterizes and learns residual connectivities from the data.
arXiv Detail & Related papers (2026-02-09T18:54:18Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - On the Unreasonable Effectiveness of Last-layer Retraining [11.989603982988344]
Last-layer retraining (LLR) methods have garnered interest as an efficient approach to rectify dependence on spurious correlations.<n>LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set.<n>We show how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.
arXiv Detail & Related papers (2025-12-01T15:08:43Z) - Value Function Initialization for Knowledge Transfer and Jump-start in Deep Reinforcement Learning [0.0]
We introduce DQInit, a method that adapts value function initialization to deep reinforcement learning.<n>DQInit reuses compact Q-values extracted from previously solved tasks as a transferable knowledge base.<n>It employs a knownness-based mechanism to softly integrate these transferred values into underexplored regions and gradually shift toward the agent's learned estimates.
arXiv Detail & Related papers (2025-08-12T18:32:08Z) - Saffron-1: Safety Inference Scaling [69.61130284742353]
SAFFRON is a novel inference scaling paradigm tailored explicitly for safety assurance.<n>Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations.<n>We publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M)
arXiv Detail & Related papers (2025-06-06T18:05:45Z) - Active Human Feedback Collection via Neural Contextual Dueling Bandits [84.7608942821423]
We propose Neural-ADB, an algorithm for collecting human preference feedback when the underlying latent reward function is non-linear.
We show that when preference feedback follows the Bradley-Terry-Luce model, the worst sub-optimality gap of the policy learned by Neural-ADB decreases at a sub-linear rate as the preference dataset increases.
arXiv Detail & Related papers (2025-04-16T12:16:10Z) - An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token.
We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains.
Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z) - Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling.
We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays.
We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z) - Diverse Priors for Deep Reinforcement Learning [2.8554857235549753]
In Reinforcement Learning (RL), agents aim at maximizing cumulative rewards in a given environment.
We introduce an innovative approach with delicately designed prior NNs, which can incorporate maximal diversity in the initial value functions of RL.
Our method has demonstrated superior performance compared with the random prior approaches in solving classic control problems and general exploration tasks.
arXiv Detail & Related papers (2023-10-23T12:33:59Z) - Deep Learning Meets Adaptive Filtering: A Stein's Unbiased Risk
Estimator Approach [13.887632153924512]
We introduce task-based deep learning frameworks, denoted as Deep RLS and Deep EASI.
These architectures transform the iterations of the original algorithms into layers of a deep neural network, enabling efficient source signal estimation.
To further enhance performance, we propose training these deep unrolled networks utilizing a surrogate loss function grounded on Stein's unbiased risk estimator (SURE)
arXiv Detail & Related papers (2023-07-31T14:26:41Z) - Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization [8.867416300893577]
Exploration remains a key challenge in deep reinforcement learning (RL)
In this paper we propose a new, differentiable optimistic objective that when optimized yields a policy that provably explores efficiently.
Results show significant performance improvements even over other efficient exploration techniques.
arXiv Detail & Related papers (2023-02-18T14:13:25Z) - A Neural-Network-Based Convex Regularizer for Inverse Problems [14.571246114579468]
Deep-learning methods to solve image-reconstruction problems have enabled a significant increase in reconstruction quality.
These new methods often lack reliability and explainability, and there is a growing interest to address these shortcomings.
In this work, we tackle this issue by revisiting regularizers that are the sum of convex-ridge functions.
The gradient of such regularizers is parameterized by a neural network that has a single hidden layer with increasing and learnable activation functions.
arXiv Detail & Related papers (2022-11-22T18:19:10Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Improving the Efficiency of Off-Policy Reinforcement Learning by
Accounting for Past Decisions [20.531576904743282]
Off-policy estimation bias is corrected in a per-decision manner.
Off-policy algorithms such as Tree Backup and Retrace rely on this mechanism.
We propose a multistep operator that permits arbitrary past-dependent traces.
arXiv Detail & Related papers (2021-12-23T00:07:28Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs.
We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z) - Experience Replay with Likelihood-free Importance Weights [123.52005591531194]
We propose to reweight experiences based on their likelihood under the stationary distribution of the current policy.
We apply the proposed approach empirically on two competitive methods, Soft Actor Critic (SAC) and Twin Delayed Deep Deterministic policy gradient (TD3)
arXiv Detail & Related papers (2020-06-23T17:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.