A Generalized Bootstrap Target for Value-Learning, Efficiently Combining
Value and Feature Predictions
- URL: http://arxiv.org/abs/2201.01836v1
- Date: Wed, 5 Jan 2022 21:54:55 GMT
- Title: A Generalized Bootstrap Target for Value-Learning, Efficiently Combining
Value and Feature Predictions
- Authors: Anthony GX-Chen, Veronica Chelu, Blake A. Richards, Joelle Pineau
- Abstract summary: Estimating value functions is a core component of reinforcement learning algorithms.
We focus on bootstrapping targets used when estimating value functions.
We propose a new backup target, the $eta$-return mixture.
- Score: 39.17511693008055
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Estimating value functions is a core component of reinforcement learning
algorithms. Temporal difference (TD) learning algorithms use bootstrapping,
i.e. they update the value function toward a learning target using value
estimates at subsequent time-steps. Alternatively, the value function can be
updated toward a learning target constructed by separately predicting successor
features (SF)--a policy-dependent model--and linearly combining them with
instantaneous rewards. We focus on bootstrapping targets used when estimating
value functions, and propose a new backup target, the $\eta$-return mixture,
which implicitly combines value-predictive knowledge (used by TD methods) with
(successor) feature-predictive knowledge--with a parameter $\eta$ capturing how
much to rely on each. We illustrate that incorporating predictive knowledge
through an $\eta\gamma$-discounted SF model makes more efficient use of sampled
experience, compared to either extreme, i.e. bootstrapping entirely on the
value function estimate, or bootstrapping on the product of separately
estimated successor features and instantaneous reward models. We empirically
show this approach leads to faster policy evaluation and better control
performance, for tabular and nonlinear function approximations, indicating
scalability and generality.
Related papers
- Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation [8.378137704007038]
We present a regret analysis for distributional reinforcement learning with general value function approximation.
Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly.
arXiv Detail & Related papers (2024-07-31T00:43:51Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Learning to Rank for Active Learning via Multi-Task Bilevel Optimization [29.207101107965563]
We propose a novel approach for active learning, which aims to select batches of unlabeled instances through a learned surrogate model for data acquisition.
A key challenge in this approach is developing an acquisition function that generalizes well, as the history of data, which forms part of the utility function's input, grows over time.
arXiv Detail & Related papers (2023-10-25T22:50:09Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Model Predictive Control with Self-supervised Representation Learning [13.225264876433528]
We propose the use of a reconstruction function within the TD-MPC framework, so that the agent can reconstruct the original observation.
Our proposed addition of another loss term leads to improved performance on both state- and image-based tasks.
arXiv Detail & Related papers (2023-04-14T16:02:04Z) - Accelerating Policy Gradient by Estimating Value Function from Prior
Computation in Deep Reinforcement Learning [16.999444076456268]
We investigate the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods.
In particular, we learn a new value function for the target task while combining it with a value estimate from the prior.
The resulting value function is used as a baseline in the policy gradient method.
arXiv Detail & Related papers (2023-02-02T20:23:22Z) - Direct Advantage Estimation [63.52264764099532]
We show that the expected return may depend on the policy in an undesirable way which could slow down learning.
We propose the Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from data.
If desired, value functions can also be seamlessly integrated into DAE and be updated in a similar way to Temporal Difference Learning.
arXiv Detail & Related papers (2021-09-13T16:09:31Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.