Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking
- URL: http://arxiv.org/abs/2208.10583v1
- Date: Mon, 22 Aug 2022 20:29:20 GMT
- Title: Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking
- Authors: Eshwar S R, Shishir Kolathaya, Gugan Thoppe
- Abstract summary: Evolution Strategy (ES) is a powerful black-box optimization technique based on the idea of natural evolution.
We propose a novel off-policy alternative for ranking, based on a local approximation for the fitness function.
We demonstrate our idea in the context of a state-of-the-art ES method called the Augmented Random Search (ARS)
- Score: 2.8176502405615396
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Evolution Strategy (ES) is a powerful black-box optimization technique based
on the idea of natural evolution. In each of its iterations, a key step entails
ranking candidate solutions based on some fitness score. For an ES method in
Reinforcement Learning (RL), this ranking step requires evaluating multiple
policies. This is presently done via on-policy approaches: each policy's score
is estimated by interacting several times with the environment using that
policy. This leads to a lot of wasteful interactions since, once the ranking is
done, only the data associated with the top-ranked policies is used for
subsequent learning. To improve sample efficiency, we propose a novel
off-policy alternative for ranking, based on a local approximation for the
fitness function. We demonstrate our idea in the context of a state-of-the-art
ES method called the Augmented Random Search (ARS). Simulations in MuJoCo tasks
show that, compared to the original ARS, our off-policy variant has similar
running times for reaching reward thresholds but needs only around 70% as much
data. It also outperforms the recent Trust Region ES. We believe our ideas
should be extendable to other ES methods as well.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Pessimistic Off-Policy Optimization for Learning to Rank [13.733459243449634]
Off-policy learning is a framework for optimizing policies without deploying them.
In recommender systems, this is especially challenging due to the imbalance in logged data.
We study pessimistic off-policy optimization for learning to rank.
arXiv Detail & Related papers (2022-06-06T12:58:28Z) - EnTRPO: Trust Region Policy Optimization Method with Entropy
Regularization [1.599072005190786]
Trust Region Policy Optimization (TRPO) is a popular and empirically successful policy search algorithm in reinforcement learning.
In this work, we use a replay buffer to borrow from the off-policy learning setting to TRPO.
We add an Entropy regularization term to advantage over pi, accumulated over time steps, in TRPO.
arXiv Detail & Related papers (2021-10-26T03:04:00Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience [9.06635747612495]
Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm.
SAC trains a policy by maximizing the trade-off between expected return and entropy.
It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks.
arXiv Detail & Related papers (2021-09-24T06:46:28Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z) - Statistical Inference of the Value Function for Reinforcement Learning
in Infinite Horizon Settings [0.0]
We construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity.
We show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique.
We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status.
arXiv Detail & Related papers (2020-01-13T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.