Stable and Efficient Policy Evaluation
- URL: http://arxiv.org/abs/2006.03978v2
- Date: Tue, 28 Dec 2021 03:15:05 GMT
- Title: Stable and Efficient Policy Evaluation
- Authors: Daoming Lyu, Bo Liu, Matthieu Geist, Wen Dong, Saad Biaz, Qi Wang
- Abstract summary: This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method.
The empirical results on various domains validate the effectiveness of the proposed approach.
- Score: 31.04376768927044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy evaluation algorithms are essential to reinforcement learning due to
their ability to predict the performance of a policy. However, there are two
long-standing issues lying in this prediction problem that need to be tackled:
off-policy stability and on-policy efficiency. The conventional temporal
difference (TD) algorithm is known to perform very well in the on-policy
setting, yet is not off-policy stable. On the other hand, the gradient TD and
emphatic TD algorithms are off-policy stable, but are not on-policy efficient.
This paper introduces novel algorithms that are both off-policy stable and
on-policy efficient by using the oblique projection method. The empirical
experimental results on various domains validate the effectiveness of the
proposed approach.
Related papers
- Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline [47.16115174891401]
We propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue.
We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.
arXiv Detail & Related papers (2024-05-04T05:21:28Z) - Distillation Policy Optimization [5.439020425819001]
We introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control.
This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline.
Our results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches.
arXiv Detail & Related papers (2023-02-01T15:59:57Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Proximal Policy Optimization Smoothed Algorithm [0.0]
We present a PPO variant, named Proximal Policy Optimization Smooth Algorithm (PPOS)
Its critical improvement is the use of a functional clipping method instead of a flat clipping method.
We show that it outperforms the latest PPO variants on both performance and stability in challenging continuous control tasks.
arXiv Detail & Related papers (2020-12-04T07:43:50Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.