Related papers: MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer Sampling

MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer Sampling

URL: http://arxiv.org/abs/2210.13545v2
Date: Mon, 17 Apr 2023 07:11:03 GMT
Title: MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer Sampling
Authors: Julius Ott, Lorenzo Servadei, Jose Arjona-Medina, Enrico Rinaldi, Gianfranco Mauro, Daniela S\'anchez Lopera, Michael Stephan, Thomas Stadelmayer, Avik Santra, Robert Wille
Abstract summary: State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. They do not incorporate uncertainty in the Q-Value estimation. This paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off.
Score: 2.501153467354696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data selection is essential for any data-based optimization technique, such as Reinforcement Learning. State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. However, they do not incorporate uncertainty in the Q-Value estimation. Consequently, they cannot adapt the sampling strategies, including exploration and exploitation of transitions, to the complexity of the task. To address this, this paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off. This is enabled by the uncertainty estimation of the Q-Value function, which guides the sampling to explore more significant transitions and, thus, learn a more efficient policy. Experiments on classical control environments demonstrate stable results across various environments. They show that the proposed method outperforms state-of-the-art sampling strategies for dense rewards w.r.t. convergence and peak performance by 26% on average.

Related papers

Sampling Strategies for Efficient Training of Deep Learning Object Detection Algorithms [4.609829289649562]
Two sampling strategies are investigated to enhance efficiency in training a deep learning object detection model.<n>The first strategy is uniform sampling which seeks to obtain samples evenly yet randomly through the state space of the object dynamics.<n>The second strategy of frame difference sampling is developed to explore the temporal redundancy among successive frames in a video.
arXiv Detail & Related papers (2025-05-23T18:54:01Z)
Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
Memory Sequence Length of Data Sampling Impacts the Adaptation of Meta-Reinforcement Learning Agents [1.971759811837406]
We investigate the impact of data sampling strategies on the exploration and adaptability of meta-RL agents. Our analysis revealed the long-memory and short-memory sequence sampling strategies affect the representation and adaptive capabilities of meta-RL agents.
arXiv Detail & Related papers (2024-06-18T07:41:40Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Gradient and Uncertainty Enhanced Sequential Sampling for Global Fit [0.0]
This paper proposes a new sampling strategy for global fit called Gradient and Uncertainty Enhanced Sequential Sampling (GUESS) We show that GUESS achieved on average the highest sample efficiency compared to other surrogate-based strategies on the tested examples.
arXiv Detail & Related papers (2023-09-29T19:49:39Z)
Sampling Through the Lens of Sequential Decision Making [9.101505546901999]
We propose a reward-guided sampling strategy called Adaptive Sample with Reward (ASR) Our approach optimally adjusts the sampling process to achieve optimal performance. Empirical results in information retrieval and clustering demonstrate ASR's superb performance across different datasets.
arXiv Detail & Related papers (2022-08-17T04:01:29Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
Impact of Channel Variation on One-Class Learning for Spoof Detection [5.549602650463701]
Spoofing detection increases the reliability of the ASV system but degrades significantly due to channel variation. "Which data-feeding strategy is optimal for MCT?" is not known in the case of spoof detection. This study highlights the relevance of the deemed-of-low-importance process of data-feeding and mini-batching to raise awareness of the need to refine it for better performance.
arXiv Detail & Related papers (2021-09-30T07:56:16Z)
Rethinking Sampling Strategies for Unsupervised Person Re-identification [59.47536050785886]
We analyze the reasons for the performance differences between various sampling strategies under the same framework and loss function. Group sampling is proposed, which gathers samples from the same class into groups. Experiments on Market-1501, DukeMTMC-reID and MSMT17 show that group sampling achieves performance comparable to state-of-the-art methods.
arXiv Detail & Related papers (2021-07-07T05:39:58Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
Federated Learning under Importance Sampling [49.17137296715029]
We study the effect of importance sampling and devise schemes for sampling agents and data non-uniformly guided by a performance measure. We find that in schemes involving sampling without replacement, the performance of the resulting architecture is controlled by two factors related to data variability at each agent.
arXiv Detail & Related papers (2020-12-14T10:08:55Z)
Optimal Importance Sampling for Federated Learning [57.14673504239551]
Federated learning involves a mixture of centralized and decentralized processing tasks. The sampling of both agents and data is generally uniform; however, in this work we consider non-uniform sampling. We derive optimal importance sampling strategies for both agent and data selection and show that non-uniform sampling without replacement improves the performance of the original FedAvg algorithm.
arXiv Detail & Related papers (2020-10-26T14:15:33Z)
Learning to Sample with Local and Global Contexts in Experience Replay Buffer [135.94190624087355]
We propose a new learning-based sampling method that can compute the relative importance of transition. We show that our framework can significantly improve the performance of various off-policy reinforcement learning methods.
arXiv Detail & Related papers (2020-07-14T21:12:56Z)
Adaptive Experience Selection for Policy Gradient [8.37609145576126]
Experience replay is a commonly used approach to improve sample efficiency. gradient estimators using past trajectories typically have high variance. Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates. We propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance.
arXiv Detail & Related papers (2020-02-17T13:16:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.