Direct Advantage Estimation
- URL: http://arxiv.org/abs/2109.06093v1
- Date: Mon, 13 Sep 2021 16:09:31 GMT
- Title: Direct Advantage Estimation
- Authors: Hsiao-Ru Pan, Nico G\"urtler, Alexander Neitz, Bernhard Sch\"olkopf
- Abstract summary: We show that the expected return may depend on the policy in an undesirable way which could slow down learning.
We propose the Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from data.
If desired, value functions can also be seamlessly integrated into DAE and be updated in a similar way to Temporal Difference Learning.
- Score: 63.52264764099532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Credit assignment is one of the central problems in reinforcement learning.
The predominant approach is to assign credit based on the expected return.
However, we show that the expected return may depend on the policy in an
undesirable way which could slow down learning. Instead, we borrow ideas from
the causality literature and show that the advantage function can be
interpreted as causal effects, which share similar properties with causal
representations. Based on this insight, we propose the Direct Advantage
Estimation (DAE), a novel method that can model the advantage function and
estimate it directly from data without requiring the (action-)value function.
If desired, value functions can also be seamlessly integrated into DAE and be
updated in a similar way to Temporal Difference Learning. The proposed method
is easy to implement and can be readily adopted by modern actor-critic methods.
We test DAE empirically on the Atari domain and show that it can achieve
competitive results with the state-of-the-art method for advantage estimation.
Related papers
- Skill or Luck? Return Decomposition via Advantage Functions [15.967056781224102]
Learning from off-policy data is essential for sample-efficient reinforcement learning.
We show that the advantage function can be understood as the causal effect of an action on the return.
This decomposition enables us to naturally extend Direct Advantage Estimation to off-policy settings.
arXiv Detail & Related papers (2024-02-20T10:09:00Z) - Evaluation of Active Feature Acquisition Methods for Time-varying Feature Settings [6.082810456767599]
Machine learning methods often assume that input features are available at no cost.
In domains like healthcare, where acquiring features could be expensive harmful, it is necessary to balance a features acquisition against its predictive positivity.
We present a problem of active feature acquisition performance evaluation (AFAPE)
arXiv Detail & Related papers (2023-12-03T23:08:29Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - Explaining Adverse Actions in Credit Decisions Using Shapley
Decomposition [8.003221404049905]
This paper focuses on credit decisions based on a predictive model for probability of default and proposes a methodology for adverse action explanation.
We consider models with low-order interactions and develop a simple and intuitive approach based on first principles.
Unlike other Shapley techniques in the literature for local interpretability of machine learning results, B-Shap is computationally tractable.
arXiv Detail & Related papers (2022-04-26T15:07:15Z) - A Generalized Bootstrap Target for Value-Learning, Efficiently Combining
Value and Feature Predictions [39.17511693008055]
Estimating value functions is a core component of reinforcement learning algorithms.
We focus on bootstrapping targets used when estimating value functions.
We propose a new backup target, the $eta$-return mixture.
arXiv Detail & Related papers (2022-01-05T21:54:55Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.