Adaptive Experience Selection for Policy Gradient
- URL: http://arxiv.org/abs/2002.06946v1
- Date: Mon, 17 Feb 2020 13:16:37 GMT
- Title: Adaptive Experience Selection for Policy Gradient
- Authors: Saad Mohamad and Giovanni Montana
- Abstract summary: Experience replay is a commonly used approach to improve sample efficiency.
gradient estimators using past trajectories typically have high variance.
Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates.
We propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance.
- Score: 8.37609145576126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient reinforcement learning (RL) algorithms have achieved
impressive performance in challenging learning tasks such as continuous
control, but suffer from high sample complexity. Experience replay is a
commonly used approach to improve sample efficiency, but gradient estimators
using past trajectories typically have high variance. Existing sampling
strategies for experience replay like uniform sampling or prioritised
experience replay do not explicitly try to control the variance of the gradient
estimates. In this paper, we propose an online learning algorithm, adaptive
experience selection (AES), to adaptively learn an experience sampling
distribution that explicitly minimises this variance. Using a regret
minimisation approach, AES iteratively updates the experience sampling
distribution to match the performance of a competitor distribution assumed to
have optimal variance. Sample non-stationarity is addressed by proposing a
dynamic (i.e. time changing) competitor distribution for which a closed-form
solution is proposed. We demonstrate that AES is a low-regret algorithm with
reasonable sample complexity. Empirically, AES has been implemented for deep
deterministic policy gradient and soft actor critic algorithms, and tested on 8
continuous control tasks from the OpenAI Gym library. Ours results show that
AES leads to significantly improved performance compared to currently available
experience sampling strategies for policy gradient.
Related papers
- Adaptive teachers for amortized samplers [76.88721198565861]
Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnormalized density where exact sampling is intractable.
Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration.
We propose an adaptive training distribution (the Teacher) to guide the training of the primary amortized sampler (the Student) by prioritizing high-loss regions.
arXiv Detail & Related papers (2024-10-02T11:33:13Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer
Sampling [2.501153467354696]
State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent.
They do not incorporate uncertainty in the Q-Value estimation.
This paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off.
arXiv Detail & Related papers (2022-10-24T18:55:41Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Sampling Through the Lens of Sequential Decision Making [9.101505546901999]
We propose a reward-guided sampling strategy called Adaptive Sample with Reward (ASR)
Our approach optimally adjusts the sampling process to achieve optimal performance.
Empirical results in information retrieval and clustering demonstrate ASR's superb performance across different datasets.
arXiv Detail & Related papers (2022-08-17T04:01:29Z) - Adaptive Client Sampling in Federated Learning via Online Learning with
Bandit Feedback [36.05851452151107]
federated learning (FL) systems need to sample a subset of clients that are involved in each round of training.
Despite its importance, there is limited work on how to sample clients effectively.
We show how our sampling method can improve the convergence speed of optimization algorithms.
arXiv Detail & Related papers (2021-12-28T23:50:52Z) - Variance Reduction based Experience Replay for Policy Optimization [3.0790370651488983]
Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation.
VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
arXiv Detail & Related papers (2021-10-17T19:28:45Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Progressive Multi-Stage Learning for Discriminative Tracking [25.94944743206374]
We propose a joint discriminative learning scheme with the progressive multi-stage optimization policy of sample selection for robust visual tracking.
The proposed scheme presents a novel time-weighted and detection-guided self-paced learning strategy for easy-to-hard sample selection.
Experiments on the benchmark datasets demonstrate the effectiveness of the proposed learning framework.
arXiv Detail & Related papers (2020-04-01T07:01:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.