Related papers: A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning

A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning

URL: http://arxiv.org/abs/2601.16399v2
Date: Mon, 26 Jan 2026 05:27:01 GMT
Title: A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning
Authors: Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel,
Abstract summary: We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP)<n>Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures.<n>We propose a single-loop, first-order actor-critic algorithm that optimize the bi-level objective via a penalty-based reformulation.
Score: 24.969317765059174
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).

Related papers

Efficient Penalty-Based Bilevel Methods: Improved Analysis, Novel Updates, and Flatness Condition [51.22672287601796]
Penalty-based methods have become popular for solving bilevel optimization (BLO) problems.<n>They often require inner-loop iteration to solve the lower-level (LL) problem and small outer-loop step sizes to handle the increased smoothness induced by large penalty terms.<n>This work considers the general BLO problems with coupled constraints (CCs) and leverages a novel penalty reformulation that decouples the upper- and lower-level variables.
arXiv Detail & Related papers (2025-11-20T20:48:14Z)
Towards a Unified Analysis of Neural Networks in Nonparametric Instrumental Variable Regression: Optimization and Generalization [66.08522228989634]
We establish the first global convergence result of neural networks for two stage least squares (2SLS) approach in nonparametric instrumental variable regression (NPIV)<n>This is achieved by adopting a lifted perspective through mean-field Langevin dynamics (MFLD)
arXiv Detail & Related papers (2025-11-18T17:51:17Z)
Provably Faster Algorithms for Bilevel Optimization via Without-Replacement Sampling [96.47086913559289]
gradient-based algorithms are widely used in bilevel optimization. We introduce a without-replacement sampling based algorithm which achieves a faster convergence rate. We validate our algorithms over both synthetic and real-world applications.
arXiv Detail & Related papers (2024-11-07T17:05:31Z)
Fast Two-Time-Scale Stochastic Gradient Method with Applications in Reinforcement Learning [5.325297567945828]
We propose a new method for two-time-scale optimization that achieves significantly faster convergence than the prior arts.<n>We show how the proposed algorithm specializes to novel online sample-based methods that surpass or match the performance of the existing state of the art.
arXiv Detail & Related papers (2024-05-15T19:03:08Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Optimal Algorithms for Stochastic Bilevel Optimization under Relaxed Smoothness Conditions [9.518010235273785]
We present a novel fully Liploop Hessian-inversion-free algorithmic framework for bilevel optimization. We show that by a slight modification of our approach our approach can handle a more general multi-objective robust bilevel optimization problem.
arXiv Detail & Related papers (2023-06-21T07:32:29Z)
A Generalized Alternating Method for Bilevel Learning under the Polyak-{\L}ojasiewicz Condition [63.66516306205932]
Bilevel optimization has recently regained interest owing to its applications in emerging machine learning fields. Recent results have shown that simple alternating iteration-based iterations can match interest owing to convex lower-level objective.
arXiv Detail & Related papers (2023-06-04T17:54:11Z)
On Penalty-based Bilevel Gradient Descent Method [35.83102074785861]
Bilevel optimization enjoys a wide range of applications in emerging machine learning and signal processing problems.<n>Recent progress on bilevel algorithms mainly focuses on bilevel optimization problems through the lens of the implicit-gradient method.<n>In this work, we tackle a challenging class of bilevel problems through the lens of the penalty method.
arXiv Detail & Related papers (2023-02-10T11:30:19Z)
Enhanced Bilevel Optimization via Bregman Distance [104.96004056928474]
We propose a bilevel optimization method based on Bregman Bregman functions. We also propose an accelerated version of SBiO-BreD method (ASBiO-BreD) by using the variance-reduced technique.
arXiv Detail & Related papers (2021-07-26T16:18:43Z)
High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level. We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.