Related papers: Learning General Policies with Policy Gradient Methods

Learning General Policies with Policy Gradient Methods

URL: http://arxiv.org/abs/2512.19366v1
Date: Mon, 22 Dec 2025 13:08:58 GMT
Title: Learning General Policies with Policy Gradient Methods
Authors: Simon Ståhlberg, Blai Bonet, Hector Geffner,
Abstract summary: provable correct policies that generalize over all instances of a given domain have been learned using methods.<n>The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches can be used.<n>We draw on lessons learned from previous and deep learning approaches, and extend them in a convenient way.
Score: 11.393603788068775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.

Related papers

Learning Branching Policies for MILPs with Proximal Policy Optimization [0.0]
Branch-and-Bound (B&B) is the dominant exact solution method for Mixed Linear Programs (MILP)<n>Current approaches rely on Imitation Learning (IL), which tends to overfit to expert demonstrations and struggles to generalize to structurally diverse or unseen instances.<n>In this work, we propose Tree-Gate Proximal Policy Optimization, a novel framework that employs Proximal Policy Optimization (PPO), a Reinforcement Learning (RL) algorithm, to train a branching policy.
arXiv Detail & Related papers (2025-11-17T05:16:14Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Unraveling the Rainbow: can value-based methods schedule? [41.94295877935867]
Broadly, deep reinforcement learning methods fall into two categories: policy-based and value-based.<n>We show that several value-based approaches can match or even outperform the widely adopted policy optimization algorithm.
arXiv Detail & Related papers (2025-05-06T08:51:17Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Learning Generalized Policies for Fully Observable Non-Deterministic Planning Domains [12.730070122798459]
General policies represent reactive strategies for solving large families of planning problems. We extend the formulations and the resulting methods for learning general policies over fully observable, non-deterministic domains.
arXiv Detail & Related papers (2024-04-03T06:25:42Z)
Policy Optimization over General State and Action Spaces [3.722665817361884]
Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. We first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all.
arXiv Detail & Related papers (2022-11-30T03:44:44Z)
Enforcing the consensus between Trajectory Optimization and Policy Learning for precise robot control [75.28441662678394]
Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages. We propose several improvements on top of these approaches to learn global control policies quicker.
arXiv Detail & Related papers (2022-09-19T13:32:09Z)
Learning Generalized Policies Without Supervision Using GNNs [20.322992960599255]
We consider the problem of learning generalized policies for classical planning domains using graph neural networks. We use a simple and general GNN architecture and aim at obtaining crisp experimental results. We exploit the relation established between the expressive power of GNNs and the $C_2$ fragment of first-order logic.
arXiv Detail & Related papers (2022-05-12T10:28:46Z)
Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks. We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z)
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL. We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)
State Augmented Constrained Reinforcement Learning: Overcoming the Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.