Monotone and Conservative Policy Iteration Beyond the Tabular Case
- URL: http://arxiv.org/abs/2506.07134v2
- Date: Sat, 11 Oct 2025 19:01:41 GMT
- Title: Monotone and Conservative Policy Iteration Beyond the Tabular Case
- Authors: S. R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal,
- Abstract summary: We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI)<n>RPI and CRPI are variants of Policy Iteration (PI) and Conservative PI (CPI)<n>We show that RPI restores the textbook textitmonotonicity of value estimates and that these estimates provably textitlower-bound the true return.
- Score: 11.483050048037752
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.
Related papers
- Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - Moments Matter:Stabilizing Policy Optimization using Return Distributions [9.430246534202857]
In continuous control tasks, even small parameter shifts can produce unstable gaits.<n>We propose an alternative that takes advantage of environmentality to update-induced variability.
arXiv Detail & Related papers (2026-01-05T05:27:11Z) - Reliable Policy Iteration: Performance Robustness Across Architecture and Environment Perturbations [11.044907865485056]
In a recent work, we proposed Reliable Policy Iteration (RPI)<n>RPI restores policy's monotonicity-of-value-estimates property to the function approximation setting.<n>We assess the robustness of RPI's empirical performance on two classical control tasks.
arXiv Detail & Related papers (2025-12-12T23:33:06Z) - Improving monotonic optimization in heterogeneous multi-agent reinforcement learning with optimal marginal deterministic policy gradient [18.64288030584699]
heterogeneous multi-agent reinforcement learning (MARL)<n>Objectively replace the sequentially computed $Q_psi*(s,a_1:i)$ with the Optimal Marginal Q function $phi_psi*(s,a_1:i)$ derived from Q-functions.<n>Generalized Q Critic (GQC) as the critic function, employing pessimistic uncertainty-constrained loss to optimize different Q-value estimations.
arXiv Detail & Related papers (2025-07-14T07:16:01Z) - Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z) - Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [15.038210624870656]
Reward inference is a critical intermediate step in the Reinforcement Learning from Human Feedback pipeline.<n>This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDP bandit, and general preference models beyond the Bradley-Terry model.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.<n>In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Upside-Down Reinforcement Learning Can Diverge in Stochastic
Environments With Episodic Resets [4.126347193869613]
Upside-Down Reinforcement Learning (UDRL) is an approach for solving problems that does not require value functions.
Goal-Conditional Supervised Learning (GCSL) optimized a lower bound on goal-reaching performance.
This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments.
arXiv Detail & Related papers (2022-05-13T12:43:25Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Structured Policy Iteration for Linear Quadratic Regulator [40.52288246664592]
We introduce the textitStructured Policy Iteration (S-PI) for LQR, a method capable of deriving a structured linear policy.
Such a structured policy with (block) sparsity or low-rank can have significant advantages over the standard LQR policy.
In both the known-model and model-free setting, we prove convergence analysis under the proper choice of parameters.
arXiv Detail & Related papers (2020-07-13T06:03:15Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z) - Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for
Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations.
We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.