LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration
- URL: http://arxiv.org/abs/2511.08339v1
- Date: Wed, 12 Nov 2025 01:54:03 GMT
- Title: LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration
- Authors: Ruiyu Qiu, Rui Wang, Guanghui Yang, Xiang Li, Zhijiang Shao,
- Abstract summary: Lexicographic multi-objective problems are common in real-world applications.<n>Traditional Safe Multi-Objective RL (LMORL) methods have difficulty enforcing prioritized priority orderings efficiently.<n>We propose Lexicographic Projected Policy Gradient RL (LMORL) to overcome these limitations.
- Score: 6.456410901431744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra's projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.
Related papers
- Performative Policy Gradient: Optimality in Performative Reinforcement Learning [13.777823115521665]
Post-deployment machine learning algorithms often influence the environments they act in.<n>We introduce the Performative Policy Gradient algorithm (PePG)<n>PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves.
arXiv Detail & Related papers (2025-12-23T18:20:06Z) - Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models [22.50153462109328]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs)<n>We introduce a Risk-Sensitive Reinforcement Learning framework.<n>Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm.<n>Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications.
arXiv Detail & Related papers (2025-09-29T04:12:20Z) - On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning [59.11784194183928]
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs)<n>Regularized Policy Gradient (RPG) view shows that the widely-used $k_3$ penalty is exactly the unnormalized KL.<n>RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
arXiv Detail & Related papers (2025-05-23T06:01:21Z) - Direct Preference Optimization for Primitive-Enabled Hierarchical Reinforcement Learning [75.9729413703531]
DIPPER is a novel HRL framework that formulates hierarchical policy learning as a bi-level optimization problem.<n>We show that DIPPER achieves up to 40% improvement over state-of-the-art baselines in sparse reward scenarios.
arXiv Detail & Related papers (2024-11-01T04:58:40Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for
Last-Iterate Convergence in Constrained MDPs [31.663072540757643]
Reinforcement Learning (RL) has been applied to real-world problems with increasing success.
We introduce Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD)
arXiv Detail & Related papers (2023-02-02T18:05:27Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - A Policy Efficient Reduction Approach to Convex Constrained Deep
Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method.
Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.