Related papers: Parameterized Projected Bellman Operator

Parameterized Projected Bellman Operator

URL: http://arxiv.org/abs/2312.12869v3
Date: Wed, 6 Mar 2024 15:14:11 GMT
Title: Parameterized Projected Bellman Operator
Authors: Th\'eo Vincent, Alberto Maria Metelli, Boris Belousov, Jan Peters, Marcello Restelli and Carlo D'Eramo
Abstract summary: Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) We propose a novel alternative approach based on learning an approximate version of the Bellman operator. We formulate an optimization problem to learn PBO for generic sequential decision-making problems.
Score: 64.129598593852
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decision-making problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems.

Related papers

Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning [47.57615889991631]
For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL)<n>This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks.
arXiv Detail & Related papers (2025-06-06T10:46:20Z)
PABBO: Preferential Amortized Black-Box Optimization [24.019185659134294]
Preferential Bayesian Optimization (PBO) is a sample-efficient method to learn latent user utilities from preferential feedback over a pair of designs. We propose to circumvent this issue by fully amortizing PBO, meta-learning both the surrogate and the acquisition function. Our method is several orders of magnitude faster than the usual Gaussian process-based strategies and often outperforms them in accuracy.
arXiv Detail & Related papers (2025-03-02T14:57:24Z)
A Variance Minimization Approach to Temporal-Difference Learning [12.026021568207206]
This paper introduces a variance minimization (VM) approach for value-based RL instead of error minimization. Based on this approach, we proposed two objectives, the Variance of Bellman Error (VBE) and the Variance of Projected Bellman Error (VPBE)
arXiv Detail & Related papers (2024-11-10T08:56:16Z)
Regularized Q-Learning with Linear Function Approximation [2.765106384328772]
We consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation. We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise.
arXiv Detail & Related papers (2024-01-26T20:45:40Z)
Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning [22.287106840756483]
We show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent challenges of Bellman completeness. We propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories.
arXiv Detail & Related papers (2023-10-30T07:03:14Z)
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models. We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z)
Learning Bellman Complete Representations for Offline Policy Evaluation [51.96704525783913]
Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL.
arXiv Detail & Related papers (2022-07-12T21:02:02Z)
ES-Based Jacobian Enables Faster Bilevel Optimization [53.675623215542515]
Bilevel optimization (BO) has arisen as a powerful tool for solving many modern machine learning problems. Existing gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations. We propose a novel BO algorithm, which adopts Evolution Strategies (ES) based method to approximate the response Jacobian matrix in the hypergradient of BO.
arXiv Detail & Related papers (2021-10-13T19:36:50Z)
Bayesian Bellman Operators [55.959376449737405]
We introduce a novel perspective on Bayesian reinforcement learning (RL) Our framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions.
arXiv Detail & Related papers (2021-06-09T12:20:46Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison [17.692408242465763]
We prove performance guarantees of two algorithms for approxing $Qstar$ in batch reinforcement learning. One of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation.
arXiv Detail & Related papers (2020-03-09T05:12:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.