Related papers: Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning

Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning

URL: http://arxiv.org/abs/2602.00282v1
Date: Fri, 30 Jan 2026 20:10:21 GMT
Title: Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning
Authors: Naman Saxena, Vaneet Aggarwal,
Abstract summary: We analyse the sample complexity of a constrained bilevel RL algorithm, building on the progress in the unconstrained setting.<n>We are the first ones to analyse the generally parameterized policy-based RL algorithm with a non-smooth objective function.
Score: 47.66330599017582
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Several important problem settings within the literature of reinforcement learning (RL), such as meta-learning, hierarchical learning, and RL from human feedback (RL-HF), can be modelled as bilevel RL problems. A lot has been achieved in these domains empirically; however, the theoretical analysis of bilevel RL algorithms hasn't received a lot of attention. In this work, we analyse the sample complexity of a constrained bilevel RL algorithm, building on the progress in the unconstrained setting. We obtain an iteration complexity of $O(ε^{-2})$ and sample complexity of $\tilde{O}(ε^{-4})$ for our proposed algorithm, Constrained Bilevel Subgradient Optimization (CBSO). We use a penalty-based objective function to avoid the issue of primal-dual gap and hyper-gradient in the context of a constrained bilevel problem setting. The penalty-based formulation to handle constraints requires analysis of non-smooth optimization. We are the first ones to analyse the generally parameterized policy gradient-based RL algorithm with a non-smooth objective function using the Moreau envelope.

Related papers

A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning [24.969317765059174]
We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP)<n>Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures.<n>We propose a single-loop, first-order actor-critic algorithm that optimize the bi-level objective via a penalty-based reformulation.
arXiv Detail & Related papers (2026-01-23T02:12:24Z)
On The Sample Complexity Bounds In Bilevel Reinforcement Learning [49.19950489963245]
Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models.<n>We present the first sample of $mathcalO(epsilon)$ in continuous state-action complexity.<n>Our analysis also extends complexity improving upon existing bounds of $mathcalO(epsilon)$.
arXiv Detail & Related papers (2025-03-22T04:22:04Z)
On The Global Convergence Of Online RLHF With Neural Parametrization [36.239015146313136]
Reinforcement Learning from Human Feedback (RLHF) aims to align large language models with human values.<n> RLHF is a three-stage process that includes supervised fine-tuning, reward learning, and policy learning.<n>We propose a bi-level formulation for AI alignment in parameterized settings and introduce a first-order approach to solve this problem.
arXiv Detail & Related papers (2024-10-21T03:13:35Z)
Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity [4.917399520581689]
Bilevel reinforcement learning (RL) features intertwined two-level problems.<n>Non-level convex information is an impediment to developing bilevel optimization methods.<n>Hyper-gradient serves as an integration of exploitation and exploration.
arXiv Detail & Related papers (2024-05-30T05:24:20Z)
Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF [82.73541793388]
We introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based gradient (policy) algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.
arXiv Detail & Related papers (2024-02-10T04:54:15Z)
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [67.66904892192794]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.<n>MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.<n>Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z)
On the Global Convergence of Fitted Q-Iteration with Two-layer Neural Network Parametrization [33.12181620473604]
We study a Fitted Q-Iteration with two-layer ReLU neural network parametrization, and find the sample complexity guarantees for the algorithm. We show that this approach achieves a sample complexity of $tildemathcalO (1/epsilon2)$, which is order-optimal.
arXiv Detail & Related papers (2022-11-14T19:00:24Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Online Sub-Sampling for Reinforcement Learning with General Function Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm. For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times. In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.