Related papers: A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

URL: http://arxiv.org/abs/2201.01836v1
Date: Wed, 5 Jan 2022 21:54:55 GMT
Title: A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions
Authors: Anthony GX-Chen, Veronica Chelu, Blake A. Richards, Joelle Pineau
Abstract summary: Estimating value functions is a core component of reinforcement learning algorithms. We focus on bootstrapping targets used when estimating value functions. We propose a new backup target, the $eta$-return mixture.
Score: 39.17511693008055
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the $\eta$-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge--with a parameter $\eta$ capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an $\eta\gamma$-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.

Related papers

Beyond Softmax: A Natural Parameterization for Categorical Random Variables [61.709831225296305]
We introduce the $textitcatnat$ function, a function composed of a sequence of hierarchical binary splits.<n>A rich set of experiments show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.
arXiv Detail & Related papers (2025-09-29T12:55:50Z)
Evolutionary Guided Decoding: Iterative Value Refinement for LLMs [41.56764640311065]
Iterative Value Refinement is a novel framework designed to bridge this gap.<n>It employs Value Exploration to provide a more comprehensive and robust training signal.<n>Iterative Self-Refinement uses the improved value function from one iteration to guide the generation of higher-quality data.
arXiv Detail & Related papers (2025-03-04T07:49:10Z)
Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation [8.378137704007038]
We present a regret analysis for distributional reinforcement learning with general value function approximation. Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly.
arXiv Detail & Related papers (2024-07-31T00:43:51Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Learning to Rank for Active Learning via Multi-Task Bilevel Optimization [29.207101107965563]
We propose a novel approach for active learning, which aims to select batches of unlabeled instances through a learned surrogate model for data acquisition. A key challenge in this approach is developing an acquisition function that generalizes well, as the history of data, which forms part of the utility function's input, grows over time.
arXiv Detail & Related papers (2023-10-25T22:50:09Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
Model Predictive Control with Self-supervised Representation Learning [13.225264876433528]
We propose the use of a reconstruction function within the TD-MPC framework, so that the agent can reconstruct the original observation. Our proposed addition of another loss term leads to improved performance on both state- and image-based tasks.
arXiv Detail & Related papers (2023-04-14T16:02:04Z)
Accelerating Policy Gradient by Estimating Value Function from Prior Computation in Deep Reinforcement Learning [16.999444076456268]
We investigate the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods. In particular, we learn a new value function for the target task while combining it with a value estimate from the prior. The resulting value function is used as a baseline in the policy gradient method.
arXiv Detail & Related papers (2023-02-02T20:23:22Z)
Direct Advantage Estimation [63.52264764099532]
We show that the expected return may depend on the policy in an undesirable way which could slow down learning. We propose the Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from data. If desired, value functions can also be seamlessly integrated into DAE and be updated in a similar way to Temporal Difference Learning.
arXiv Detail & Related papers (2021-09-13T16:09:31Z)
Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL) We propose Zeroth-Order Supervised Policy Improvement (ZOSPI) ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)
Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. We develop an approach for representation learning in RL that sits in between these two extremes. This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.