META-Learning Eligibility Traces for More Sample Efficient Temporal
Difference Learning
- URL: http://arxiv.org/abs/2006.08906v1
- Date: Tue, 16 Jun 2020 03:41:07 GMT
- Title: META-Learning Eligibility Traces for More Sample Efficient Temporal
Difference Learning
- Authors: Mingde Zhao
- Abstract summary: We propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner.
The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online.
We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error.
- Score: 2.0559497209595823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal-Difference (TD) learning is a standard and very successful
reinforcement learning approach, at the core of both algorithms that learn the
value of a given policy, as well as algorithms which learn how to improve
policies. TD-learning with eligibility traces provides a way to do temporal
credit assignment, i.e. decide which portion of a reward should be assigned to
predecessor states that occurred at different previous times, controlled by a
parameter $\lambda$. However, tuning this parameter can be time-consuming, and
not tuning it can lead to inefficient learning. To improve the sample
efficiency of TD-learning, we propose a meta-learning method for adjusting the
eligibility trace parameter, in a state-dependent manner. The adaptation is
achieved with the help of auxiliary learners that learn distributional
information about the update targets online, incurring roughly the same
computational complexity per step as the usual value learner. Our approach can
be used both in on-policy and off-policy learning. We prove that, under some
assumptions, the proposed method improves the overall quality of the update
targets, by minimizing the overall target error. This method can be viewed as a
plugin which can also be used to assist prediction with function approximation
by meta-learning feature (observation)-based $\lambda$ online, or even in the
control case to assist policy improvement. Our empirical evaluation
demonstrates significant performance improvements, as well as improved
robustness of the proposed algorithm to learning rate variation.
Related papers
- Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL)
LRRL is a meta-learning approach that selects the learning rate based on the agent's performance during training.
Our empirical results demonstrate that LRRL can substantially improve the performance of deep RL algorithms.
arXiv Detail & Related papers (2024-10-16T14:15:28Z) - Distillation Policy Optimization [5.439020425819001]
We introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control.
This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline.
Our results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches.
arXiv Detail & Related papers (2023-02-01T15:59:57Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Meta-learning the Learning Trends Shared Across Tasks [123.10294801296926]
Gradient-based meta-learning algorithms excel at quick adaptation to new tasks with limited data.
Existing meta-learning approaches only depend on the current task information during the adaptation.
We propose a 'Path-aware' model-agnostic meta-learning approach.
arXiv Detail & Related papers (2020-10-19T08:06:47Z) - Discovering Reinforcement Learning Algorithms [53.72358280495428]
Reinforcement learning algorithms update an agent's parameters according to one of several possible rules.
This paper introduces a new meta-learning approach that discovers an entire update rule.
It includes both 'what to predict' (e.g. value functions) and 'how to learn from it' by interacting with a set of environments.
arXiv Detail & Related papers (2020-07-17T07:38:39Z) - Meta-Gradient Reinforcement Learning with an Objective Discovered Online [54.15180335046361]
We propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network.
Because the objective is discovered online, it can adapt to changes over time.
On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency.
arXiv Detail & Related papers (2020-07-16T16:17:09Z) - Gradient Monitored Reinforcement Learning [0.0]
We focus on the enhancement of training and evaluation performance in reinforcement learning algorithms.
We propose an approach to steer the learning in the weight parameters of a neural network based on the dynamic development and feedback from the training process itself.
arXiv Detail & Related papers (2020-05-25T13:45:47Z) - Deep Reinforcement Learning for Adaptive Learning Systems [4.8685842576962095]
We formulate the problem of how to find an individualized learning plan based on learner's latent traits.
We apply a model-free deep reinforcement learning algorithm that can effectively find the optimal learning policy.
We also develop a transition model estimator that emulates the learner's learning process using neural networks.
arXiv Detail & Related papers (2020-04-17T18:04:03Z) - Hierarchical Variational Imitation Learning of Control Programs [131.7671843857375]
We propose a variational inference method for imitation learning of a control policy represented by parametrized hierarchical procedures (PHP)
Our method discovers the hierarchical structure in a dataset of observation-action traces of teacher demonstrations, by learning an approximate posterior distribution over the latent sequence of procedure calls and terminations.
We demonstrate a novel benefit of variational inference in the context of hierarchical imitation learning: in decomposing the policy into simpler procedures, inference can leverage acausal information that is unused by other methods.
arXiv Detail & Related papers (2019-12-29T08:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.