Related papers: Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

URL: http://arxiv.org/abs/2403.15301v2
Date: Mon, 3 Jun 2024 14:56:28 GMT
Title: Planning with a Learned Policy Basis to Optimally Solve Complex Tasks
Authors: Guillermo Infante, David Kuric, Anders Jonsson, Vicenç Gómez, Herke van Hoof,
Abstract summary: We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning.
Score: 26.621462241759133
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.

Related papers

Collab-Solver: Collaborative Solving Policy Learning for Mixed-Integer Linear Programming [57.44900640134789]
We propose a novel multi-agent-based policy learning framework for MILP solving as a Stackelberg game.<n>Specifically, we formulate the collaboration of cut selection and branching in MILP solving as a Stackelberg game.<n>The jointly learned policy significantly improves the solving performance on both synthetic and large-scale real-world MILP datasets.
arXiv Detail & Related papers (2025-08-05T03:16:04Z)
Constructing an Optimal Behavior Basis for the Option Keyboard [15.595163824752769]
Generalized Policy Improvement (GPI) addresses this by combining a set of base policies to produce a new one that is at least as good.<n>The Option Keyboard (OK) improves upon GPI by producing policies that are at least as good -- and often better.<n>This raises a key question: is there an optimal set of base policies that enables zero-shot identification of optimal solutions for any linear tasks?<n>We show that it significantly reduces the number of base policies needed to ensure optimality in new tasks.
arXiv Detail & Related papers (2025-05-01T18:32:21Z)
Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces. We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition. Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z)
Federated Reinforcement Learning with Constraint Heterogeneity [22.79217297480751]
We study a Federated Reinforcement Learning (FedRL) problem with constraint heterogeneity. We show that FedNPG achieves global convergence with an $tildeO (1/sqrtT)$ rate, and FedPPO efficiently solves complicated learning tasks with the use of deep neural networks.
arXiv Detail & Related papers (2024-05-06T07:44:50Z)
Natural Policy Gradient and Actor Critic Methods for Constrained Multi-Task Reinforcement Learning [13.908826484332282]
Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multi-task RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task.
arXiv Detail & Related papers (2024-05-03T19:43:30Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods. Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z)
Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces. We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z)
Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization [8.836422771217084]
Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks.
arXiv Detail & Related papers (2023-01-18T20:54:40Z)
Multi-Task Off-Policy Learning from Bandit Feedback [54.96011624223482]
We propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them. We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model. Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.
arXiv Detail & Related papers (2022-12-09T08:26:27Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set. We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z)
Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks. We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.