Stepsize Learning for Policy Gradient Methods in Contextual Markov
Decision Processes
- URL: http://arxiv.org/abs/2306.07741v1
- Date: Tue, 13 Jun 2023 12:58:12 GMT
- Title: Stepsize Learning for Policy Gradient Methods in Contextual Markov
Decision Processes
- Authors: Luca Sabbioni, Francesco Corda, Marcello Restelli
- Abstract summary: Policy-based algorithms are among the most widely adopted techniques in model-free RL.
They tend to struggle when asked to accomplish a series of heterogeneous tasks.
We introduce a new formulation, known as meta-MDP, that can be used to solve any hyper parameter selection problem in RL.
- Score: 35.889129338603446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Policy-based algorithms are among the most widely adopted techniques in
model-free RL, thanks to their strong theoretical groundings and good
properties in continuous action spaces. Unfortunately, these methods require
precise and problem-specific hyperparameter tuning to achieve good performance,
and tend to struggle when asked to accomplish a series of heterogeneous tasks.
In particular, the selection of the step size has a crucial impact on their
ability to learn a highly performing policy, affecting the speed and the
stability of the training process, and often being the main culprit for poor
results. In this paper, we tackle these issues with a Meta Reinforcement
Learning approach, by introducing a new formulation, known as meta-MDP, that
can be used to solve any hyperparameter selection problem in RL with contextual
processes. After providing a theoretical Lipschitz bound to the difference of
performance in different tasks, we adopt the proposed framework to train a
batch RL algorithm to dynamically recommend the most adequate step size for
different policies and tasks. In conclusion, we present an experimental
campaign to show the advantages of selecting an adaptive learning rate in
heterogeneous environments.
Related papers
- HarmoDT: Harmony Multi-Task Decision Transformer for Offline Reinforcement Learning [72.25707314772254]
We introduce the Harmony Multi-Task Decision Transformer (HarmoDT), a novel solution designed to identify an optimal harmony subspace of parameters for each task.
The upper level of this framework is dedicated to learning a task-specific mask that delineates the harmony subspace, while the inner level focuses on updating parameters to enhance the overall performance of the unified policy.
arXiv Detail & Related papers (2024-05-28T11:41:41Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Reinforcement Learning Methods for Wordle: A POMDP/Adaptive Control
Approach [0.3093890460224435]
We address the solution of the popular Wordle puzzle, using new reinforcement learning methods.
For the Wordle puzzle, they yield on-line solution strategies that are very close to optimal at relatively modest computational cost.
arXiv Detail & Related papers (2022-11-15T03:46:41Z) - Online Control of Adaptive Large Neighborhood Search using Deep Reinforcement Learning [4.374837991804085]
We introduce a Deep Reinforcement Learning based approach called DR-ALNS that selects operators, adjusts parameters, and controls the acceptance criterion throughout the search.
We evaluate the proposed method on a problem with orienteering weights and time windows, as presented in an IJCAI competition.
The results show that our approach outperforms vanilla ALNS, ALNS tuned with Bayesian optimization, and two state-of-the-art DRL approaches.
arXiv Detail & Related papers (2022-11-01T21:33:46Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Episodic Policy Gradient Training [43.62408764384791]
Episodic Policy Gradient Training (EPGT)
We introduce a novel training procedure for policy gradient methods wherein episodic memory is used to optimize the hyperparameters of reinforcement learning algorithms on-the-fly.
Experimental results on both continuous and discrete environments demonstrate the advantage of using the proposed method in boosting the performance of various policy gradient algorithms.
arXiv Detail & Related papers (2021-12-03T11:15:32Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Policy Information Capacity: Information-Theoretic Measure for Task
Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty.
We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives.
These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.