Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning
- URL: http://arxiv.org/abs/2506.05968v1
- Date: Fri, 06 Jun 2025 10:46:20 GMT
- Title: Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning
- Authors: Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada,
- Abstract summary: For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL)<n>This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks.
- Score: 47.57615889991631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality.
Related papers
- Spectral Bellman Method: Unifying Representation and Exploration in RL [54.71169912483302]
This work introduces Spectral Bellman Representation, a novel framework for learning representations for value-based reinforcement learning.<n>We show that our learned representations enable structured exploration, by aligning feature covariance with Bellman dynamics.<n>Our framework naturally extends to powerful multi-step Bellman operators, further broadening its impact.
arXiv Detail & Related papers (2025-07-17T14:50:52Z) - Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z) - To bootstrap or to rollout? An optimal and adaptive interpolation [4.755935781862859]
We introduce a class of Bellman operators that interpolate between bootstrapping and rollout methods.<n>Our estimator combines the strengths of the bootstrapping-based temporal difference (TD) estimator and the rollout-based Monte Carlo (MC) methods.
arXiv Detail & Related papers (2024-11-14T19:00:00Z) - Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions [29.69428894587431]
It is assumed that Bellman holds, which ensures that these regression problems are well-specified.
We give the first particular algorithm for RL under linear Bellman when the number actions is any constant.
arXiv Detail & Related papers (2024-06-17T15:24:49Z) - Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning [19.4531905603925]
i-QN is a principled approach that enables multiple consecutive Bellman updates by learning a tailored sequence of action-value functions.<n>We show that i-QN is theoretically grounded and that it can be seamlessly used in value-based and actor-critic methods.
arXiv Detail & Related papers (2024-03-04T15:07:33Z) - Parameterized Projected Bellman Operator [64.129598593852]
Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL)
We propose a novel alternative approach based on learning an approximate version of the Bellman operator.
We formulate an optimization problem to learn PBO for generic sequential decision-making problems.
arXiv Detail & Related papers (2023-12-20T09:33:16Z) - Multi-Bellman operator for convergence of $Q$-learning with linear
function approximation [3.6218162133579694]
We study the convergence of $Q$-learning with linear function approximation.
By exploring the properties of a novel multi-Bellman operator, we identify conditions under which the projected multi-Bellman operator becomes contractive.
We demonstrate that this algorithm converges to the fixed-point of the projected multi-Bellman operator, yielding solutions of arbitrary accuracy.
arXiv Detail & Related papers (2023-09-28T19:56:31Z) - Bayesian Bellman Operators [55.959376449737405]
We introduce a novel perspective on Bayesian reinforcement learning (RL)
Our framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions.
arXiv Detail & Related papers (2021-06-09T12:20:46Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.