Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning
- URL: http://arxiv.org/abs/2602.00282v1
- Date: Fri, 30 Jan 2026 20:10:21 GMT
- Title: Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning
- Authors: Naman Saxena, Vaneet Aggarwal,
- Abstract summary: We analyse the sample complexity of a constrained bilevel RL algorithm, building on the progress in the unconstrained setting.<n>We are the first ones to analyse the generally parameterized policy-based RL algorithm with a non-smooth objective function.
- Score: 47.66330599017582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Several important problem settings within the literature of reinforcement learning (RL), such as meta-learning, hierarchical learning, and RL from human feedback (RL-HF), can be modelled as bilevel RL problems. A lot has been achieved in these domains empirically; however, the theoretical analysis of bilevel RL algorithms hasn't received a lot of attention. In this work, we analyse the sample complexity of a constrained bilevel RL algorithm, building on the progress in the unconstrained setting. We obtain an iteration complexity of $O(ε^{-2})$ and sample complexity of $\tilde{O}(ε^{-4})$ for our proposed algorithm, Constrained Bilevel Subgradient Optimization (CBSO). We use a penalty-based objective function to avoid the issue of primal-dual gap and hyper-gradient in the context of a constrained bilevel problem setting. The penalty-based formulation to handle constraints requires analysis of non-smooth optimization. We are the first ones to analyse the generally parameterized policy gradient-based RL algorithm with a non-smooth objective function using the Moreau envelope.
Related papers
- A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning [24.969317765059174]
We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP)<n>Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures.<n>We propose a single-loop, first-order actor-critic algorithm that optimize the bi-level objective via a penalty-based reformulation.
arXiv Detail & Related papers (2026-01-23T02:12:24Z) - On The Sample Complexity Bounds In Bilevel Reinforcement Learning [49.19950489963245]
Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models.<n>We present the first sample of $mathcalO(epsilon)$ in continuous state-action complexity.<n>Our analysis also extends complexity improving upon existing bounds of $mathcalO(epsilon)$.
arXiv Detail & Related papers (2025-03-22T04:22:04Z) - On The Global Convergence Of Online RLHF With Neural Parametrization [36.239015146313136]
Reinforcement Learning from Human Feedback (RLHF) aims to align large language models with human values.<n> RLHF is a three-stage process that includes supervised fine-tuning, reward learning, and policy learning.<n>We propose a bi-level formulation for AI alignment in parameterized settings and introduce a first-order approach to solve this problem.
arXiv Detail & Related papers (2024-10-21T03:13:35Z) - Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity [4.917399520581689]
Bilevel reinforcement learning (RL) features intertwined two-level problems.<n>Non-level convex information is an impediment to developing bilevel optimization methods.<n>Hyper-gradient serves as an integration of exploitation and exploration.
arXiv Detail & Related papers (2024-05-30T05:24:20Z) - Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF [82.73541793388]
We introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation.
We provide theoretical studies of the problem landscape and its penalty-based gradient (policy) algorithms.
We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.
arXiv Detail & Related papers (2024-02-10T04:54:15Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [67.66904892192794]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.<n>MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.<n>Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - On the Global Convergence of Fitted Q-Iteration with Two-layer Neural
Network Parametrization [33.12181620473604]
We study a Fitted Q-Iteration with two-layer ReLU neural network parametrization, and find the sample complexity guarantees for the algorithm.
We show that this approach achieves a sample complexity of $tildemathcalO (1/epsilon2)$, which is order-optimal.
arXiv Detail & Related papers (2022-11-14T19:00:24Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.