Skill or Luck? Return Decomposition via Advantage Functions
- URL: http://arxiv.org/abs/2402.12874v1
- Date: Tue, 20 Feb 2024 10:09:00 GMT
- Title: Skill or Luck? Return Decomposition via Advantage Functions
- Authors: Hsiao-Ru Pan, Bernhard Sch\"olkopf
- Abstract summary: Learning from off-policy data is essential for sample-efficient reinforcement learning.
We show that the advantage function can be understood as the causal effect of an action on the return.
This decomposition enables us to naturally extend Direct Advantage Estimation to off-policy settings.
- Score: 15.967056781224102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from off-policy data is essential for sample-efficient reinforcement
learning. In the present work, we build on the insight that the advantage
function can be understood as the causal effect of an action on the return, and
show that this allows us to decompose the return of a trajectory into parts
caused by the agent's actions (skill) and parts outside of the agent's control
(luck). Furthermore, this decomposition enables us to naturally extend Direct
Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The
resulting method can learn from off-policy trajectories without relying on
importance sampling techniques or truncating off-policy actions. We draw
connections between Off-policy DAE and previous methods to demonstrate how it
can speed up learning and when the proposed off-policy corrections are
important. Finally, we use the MinAtar environments to illustrate how ignoring
off-policy corrections can lead to suboptimal policy optimization performance.
Related papers
- Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Flow to Control: Offline Reinforcement Learning with Lossless Primitive
Discovery [31.49638957903016]
offline reinforcement learning (RL) enables the agent to effectively learn from logged data.
We show that our method has a good representation ability for policies and achieves superior performance in most tasks.
arXiv Detail & Related papers (2022-12-02T11:35:51Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Direct Advantage Estimation [63.52264764099532]
We show that the expected return may depend on the policy in an undesirable way which could slow down learning.
We propose the Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from data.
If desired, value functions can also be seamlessly integrated into DAE and be updated in a similar way to Temporal Difference Learning.
arXiv Detail & Related papers (2021-09-13T16:09:31Z) - APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized.
The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Self-Imitation Advantage Learning [43.8107780378031]
Self-imitation learning is a Reinforcement Learning method that encourages actions whose returns were higher than expected.
We propose a novel generalization of self-imitation learning for off-policy RL, based on a modification of the Bellman optimality operator.
arXiv Detail & Related papers (2020-12-22T13:21:50Z) - Faded-Experience Trust Region Policy Optimization for Model-Free Power
Allocation in Interference Channel [28.618312473850974]
Policy reinforcement learning techniques enable an agent to learn an optimal action policy through the interactions with the environment.
Inspired by human decision making approach, we work toward enhancing its convergence speed by augmenting the agent to memorize and use the recently learned policies.
Results indicate that with FE-TRPO it is possible to almost double the learning speed compared to TRPO.
arXiv Detail & Related papers (2020-08-04T17:12:29Z) - Data-efficient Hindsight Off-policy Option Learning [20.42535406663446]
We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm.
It robustly trains all policy components off-policy and end-to-end.
The approach outperforms existing option learning methods on common benchmarks.
arXiv Detail & Related papers (2020-07-30T16:52:33Z) - Off-Policy Adversarial Inverse Reinforcement Learning [0.0]
Adversarial Imitation Learning (AIL) is a class of algorithms in Reinforcement learning (RL)
We propose an Off-Policy Adversarial Inverse Reinforcement Learning (Off-policy-AIRL) algorithm which is sample efficient as well as gives good imitation performance.
arXiv Detail & Related papers (2020-05-03T16:51:40Z) - Efficient Deep Reinforcement Learning via Adaptive Policy Transfer [50.51637231309424]
Policy Transfer Framework (PTF) is proposed to accelerate Reinforcement Learning (RL)
Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it.
Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods.
arXiv Detail & Related papers (2020-02-19T07:30:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.