META-Learning Eligibility Traces for More Sample Efficient Temporal
Difference Learning
- URL: http://arxiv.org/abs/2006.08906v1
- Date: Tue, 16 Jun 2020 03:41:07 GMT
- Title: META-Learning Eligibility Traces for More Sample Efficient Temporal
Difference Learning
- Authors: Mingde Zhao
- Abstract summary: We propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner.
The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online.
We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error.
- Score: 2.0559497209595823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal-Difference (TD) learning is a standard and very successful
reinforcement learning approach, at the core of both algorithms that learn the
value of a given policy, as well as algorithms which learn how to improve
policies. TD-learning with eligibility traces provides a way to do temporal
credit assignment, i.e. decide which portion of a reward should be assigned to
predecessor states that occurred at different previous times, controlled by a
parameter $\lambda$. However, tuning this parameter can be time-consuming, and
not tuning it can lead to inefficient learning. To improve the sample
efficiency of TD-learning, we propose a meta-learning method for adjusting the
eligibility trace parameter, in a state-dependent manner. The adaptation is
achieved with the help of auxiliary learners that learn distributional
information about the update targets online, incurring roughly the same
computational complexity per step as the usual value learner. Our approach can
be used both in on-policy and off-policy learning. We prove that, under some
assumptions, the proposed method improves the overall quality of the update
targets, by minimizing the overall target error. This method can be viewed as a
plugin which can also be used to assist prediction with function approximation
by meta-learning feature (observation)-based $\lambda$ online, or even in the
control case to assist policy improvement. Our empirical evaluation
demonstrates significant performance improvements, as well as improved
robustness of the proposed algorithm to learning rate variation.
Related papers
- Online inductive learning from answer sets for efficient reinforcement learning exploration [52.03682298194168]
We exploit inductive learning of answer set programs to learn a set of logical rules representing an explainable approximation of the agent policy.
We then perform answer set reasoning on the learned rules to guide the exploration of the learning agent at the next batch.
Our methodology produces a significant boost in the discounted return achieved by the agent, even in the first batches of training.
arXiv Detail & Related papers (2025-01-13T16:13:22Z) - Online Reinforcement Learning-Based Dynamic Adaptive Evaluation Function for Real-Time Strategy Tasks [5.115170525117103]
Effective evaluation of real-time strategy tasks requires adaptive mechanisms to cope with dynamic and unpredictable environments.
This study proposes a method to improve evaluation functions for real-time responsiveness to battle-field situation changes.
arXiv Detail & Related papers (2025-01-07T14:36:33Z) - Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
In deep Reinforcement Learning (RL) models trained using gradient-based techniques, the choice of gradient and its learning rate are crucial to achieving good performance.
We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL), a meta-learning approach that selects the learning rate based on the agent's performance during training.
arXiv Detail & Related papers (2024-10-16T14:15:28Z) - Distillation Policy Optimization [5.439020425819001]
We introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control.
This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline.
Our results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches.
arXiv Detail & Related papers (2023-02-01T15:59:57Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Meta-learning the Learning Trends Shared Across Tasks [123.10294801296926]
Gradient-based meta-learning algorithms excel at quick adaptation to new tasks with limited data.
Existing meta-learning approaches only depend on the current task information during the adaptation.
We propose a 'Path-aware' model-agnostic meta-learning approach.
arXiv Detail & Related papers (2020-10-19T08:06:47Z) - Discovering Reinforcement Learning Algorithms [53.72358280495428]
Reinforcement learning algorithms update an agent's parameters according to one of several possible rules.
This paper introduces a new meta-learning approach that discovers an entire update rule.
It includes both 'what to predict' (e.g. value functions) and 'how to learn from it' by interacting with a set of environments.
arXiv Detail & Related papers (2020-07-17T07:38:39Z) - Deep Reinforcement Learning for Adaptive Learning Systems [4.8685842576962095]
We formulate the problem of how to find an individualized learning plan based on learner's latent traits.
We apply a model-free deep reinforcement learning algorithm that can effectively find the optimal learning policy.
We also develop a transition model estimator that emulates the learner's learning process using neural networks.
arXiv Detail & Related papers (2020-04-17T18:04:03Z) - Hierarchical Variational Imitation Learning of Control Programs [131.7671843857375]
We propose a variational inference method for imitation learning of a control policy represented by parametrized hierarchical procedures (PHP)
Our method discovers the hierarchical structure in a dataset of observation-action traces of teacher demonstrations, by learning an approximate posterior distribution over the latent sequence of procedure calls and terminations.
We demonstrate a novel benefit of variational inference in the context of hierarchical imitation learning: in decomposing the policy into simpler procedures, inference can leverage acausal information that is unused by other methods.
arXiv Detail & Related papers (2019-12-29T08:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.