Robust Action Gap Increasing with Clipped Advantage Learning
- URL: http://arxiv.org/abs/2203.11677v1
- Date: Sun, 20 Mar 2022 03:41:26 GMT
- Title: Robust Action Gap Increasing with Clipped Advantage Learning
- Authors: Zhe Zhang, Yaozhong Gan, Xiaoyang Tan
- Abstract summary: We present a novel method, named clipped Advantage Learning (clipped AL) to address this issue.
Our simple clipped AL operator not only enjoys fast convergence guarantee but also retains proper action gaps, hence achieving a good balance between the large action gap and the fast convergence.
- Score: 20.760987175553645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advantage Learning (AL) seeks to increase the action gap between the optimal
action and its competitors, so as to improve the robustness to estimation
errors. However, the method becomes problematic when the optimal action induced
by the approximated value function does not agree with the true optimal action.
In this paper, we present a novel method, named clipped Advantage Learning
(clipped AL), to address this issue. The method is inspired by our observation
that increasing the action gap blindly for all given samples while not taking
their necessities into account could accumulate more errors in the performance
loss bound, leading to a slow value convergence, and to avoid that, we should
adjust the advantage value adaptively. We show that our simple clipped AL
operator not only enjoys fast convergence guarantee but also retains proper
action gaps, hence achieving a good balance between the large action gap and
the fast convergence. The feasibility and effectiveness of the proposed method
are verified empirically on several RL benchmarks with promising performance.
Related papers
- Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Optimal convex $M$-estimation via score matching [6.115859302936817]
We construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal variance in the downstream estimation of the regression coefficients.
Our semiparametric approach targets the best decreasing approximation of the derivative of the derivative of the log-density of the noise distribution.
arXiv Detail & Related papers (2024-03-25T12:23:19Z) - Smoothing Advantage Learning [20.760987175553645]
We propose a simple variant of Advantage learning (AL) named smoothing advantage learning (SAL)
The proposed value smoothing technique not only helps to stabilize the training procedure of AL by controlling the trade-off between convergence rate and the upper bound of the approximation errors, but is beneficial to increase the action gap between the optimal and sub-optimal action value as well.
arXiv Detail & Related papers (2022-03-20T03:52:32Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - Direct Advantage Estimation [63.52264764099532]
We show that the expected return may depend on the policy in an undesirable way which could slow down learning.
We propose the Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from data.
If desired, value functions can also be seamlessly integrated into DAE and be updated in a similar way to Temporal Difference Learning.
arXiv Detail & Related papers (2021-09-13T16:09:31Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Fast Rates for Contextual Linear Optimization [52.39202699484225]
We show that a naive plug-in approach achieves regret convergence rates that are significantly faster than methods that directly optimize downstream decision performance.
Our results are overall positive for practice: predictive models are easy and fast to train using existing tools, simple to interpret, and, as we show, lead to decisions that perform very well.
arXiv Detail & Related papers (2020-11-05T18:43:59Z) - BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model.
Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.