Related papers: Predictive Lagrangian Optimization for Constrained Reinforcement Learning

Predictive Lagrangian Optimization for Constrained Reinforcement Learning

URL: http://arxiv.org/abs/2501.15217v1
Date: Sat, 25 Jan 2025 13:39:45 GMT
Title: Predictive Lagrangian Optimization for Constrained Reinforcement Learning
Authors: Tianqi Zhang, Puzhen Yuan, Guojian Zhan, Ziyu Lin, Yao Lyu, Zhenzhi Qin, Jingliang Duan, Liping Zhang, Shengbo Eben Li,
Abstract summary: Constrained optimization is popularly seen in reinforcement learning for addressing complex control tasks.<n>In this paper, we propose a more generic equivalence framework to build the connection between constrained optimization and feedback control system.
Score: 15.082498910832529
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Constrained optimization is popularly seen in reinforcement learning for addressing complex control tasks. From the perspective of dynamic system, iteratively solving a constrained optimization problem can be framed as the temporal evolution of a feedback control system. Classical constrained optimization methods, such as penalty and Lagrangian approaches, inherently use proportional and integral feedback controllers. In this paper, we propose a more generic equivalence framework to build the connection between constrained optimization and feedback control system, for the purpose of developing more effective constrained RL algorithms. Firstly, we define that each step of the system evolution determines the Lagrange multiplier by solving a multiplier feedback optimal control problem (MFOCP). In this problem, the control input is multiplier, the state is policy parameters, the dynamics is described by policy gradient descent, and the objective is to minimize constraint violations. Then, we introduce a multiplier guided policy learning (MGPL) module to perform policy parameters updating. And we prove that the resulting optimal policy, achieved through alternating MFOCP and MGPL, aligns with the solution of the primal constrained RL problem, thereby establishing our equivalence framework. Furthermore, we point out that the existing PID Lagrangian is merely one special case within our framework that utilizes a PID controller. We also accommodate the integration of other various feedback controllers, thereby facilitating the development of new algorithms. As a representative, we employ model predictive control (MPC) as the feedback controller and consequently propose a new algorithm called predictive Lagrangian optimization (PLO). Numerical experiments demonstrate its superiority over the PID Lagrangian method, achieving a larger feasible region up to 7.2% and a comparable average reward.

Related papers

Multi-Objective Reward and Preference Optimization: Theory and Algorithms [3.316593788543852]
This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models.<n>ACPO, e-COP, warmPref-PS, PSPL, and MOPO advance RL across average-cost, episodic, and preference-driven paradigms.<n> Collectively, the thesis unifies RL across average-cost, episodic, and preference-driven paradigms, delivering theoretical advances and practical tools for safe and aligned decision-making.
arXiv Detail & Related papers (2025-12-11T12:51:21Z)
Intersection of Reinforcement Learning and Bayesian Optimization for Intelligent Control of Industrial Processes: A Safe MPC-based DPG using Multi-Objective BO [0.0]
Model Predictive Control (MPC)-based Reinforcement Learning (RL) offers a structured and interpretable alternative to Deep Neural Network (DNN)-based RL methods.<n>Standard MPC-RL approaches often suffer from slow convergence, suboptimal policy learning due to limited parameterization, and safety issues during online adaptation.<n>We propose a novel framework that integrates MPC-RL with Multi-Objective Bayesian Optimization (MOBO)
arXiv Detail & Related papers (2025-07-14T02:31:52Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
SOMTP: Self-Supervised Learning-Based Optimizer for MPC-Based Safe Trajectory Planning Problems in Robotics [13.129654942805846]
Model Predictive Control (MP)-based trajectory planning has been widely used in, and Control Barrier (CBF) can improve its constraints. In this paper, we propose a self-supervised learning algorithm for CBF-MPC trajectory planning.
arXiv Detail & Related papers (2024-05-15T09:38:52Z)
Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning [132.7040981721302]
We study the Constrained Convex Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure. Design algorithms for a constrained convex MDP faces several challenges, including handling the large state space.
arXiv Detail & Related papers (2024-02-16T16:35:18Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates. We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change. We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z)
Towards a Theoretical Foundation of Policy Optimization for Learning Control Policies [26.04704565406123]
Gradient-based methods have been widely used for system design and optimization in diverse application domains. Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning. This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis.
arXiv Detail & Related papers (2022-10-10T16:13:34Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
On the Optimization Landscape of Dynamic Output Feedback: A Case Study for Linear Quadratic Regulator [12.255864026960403]
We show how the dLQR cost varies with the coordinate transformation of the dynamic controller and then derive the optimal transformation for a given observable stabilizing controller. These results shed light on designing efficient algorithms for general decision-making problems with partially observed information.
arXiv Detail & Related papers (2022-09-12T06:43:35Z)
Multi-Agent Deep Reinforcement Learning in Vehicular OCC [14.685237010856953]
We introduce a spectral efficiency optimization approach in vehicular OCC. We model the optimization problem as a Markov decision process (MDP) to enable the use of solutions that can be applied online. We verify the performance of our proposed scheme through extensive simulations and compare it with various variants of our approach and a random method.
arXiv Detail & Related papers (2022-05-05T14:25:54Z)
Recurrent Model Predictive Control [19.047059454849897]
We propose an off-line algorithm, called Recurrent Model Predictive Control (RMPC), to solve general nonlinear finite-horizon optimal control problems. Our algorithm employs a recurrent function to approximate the optimal policy, which maps the system states and reference values directly to the control inputs.
arXiv Detail & Related papers (2021-02-23T15:01:36Z)
Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications. We propose a learning algorithm that decouples the action constraints from the policy parameter update. We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z)
Optimization-driven Deep Reinforcement Learning for Robust Beamforming in IRS-assisted Wireless Communications [54.610318402371185]
Intelligent reflecting surface (IRS) is a promising technology to assist downlink information transmissions from a multi-antenna access point (AP) to a receiver. We minimize the AP's transmit power by a joint optimization of the AP's active beamforming and the IRS's passive beamforming. We propose a deep reinforcement learning (DRL) approach that can adapt the beamforming strategies from past experiences.
arXiv Detail & Related papers (2020-05-25T01:42:55Z)
Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO) We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.