Jointly Learning Environments and Control Policies with Projected
Stochastic Gradient Ascent
- URL: http://arxiv.org/abs/2006.01738v4
- Date: Thu, 6 Jan 2022 12:25:26 GMT
- Title: Jointly Learning Environments and Control Policies with Projected
Stochastic Gradient Ascent
- Authors: Adrien Bolland, Ioannis Boukas, Mathias Berger, Damien Ernst
- Abstract summary: We introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem.
In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation.
We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations.
- Score: 3.118384520557952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the joint design and control of discrete-time stochastic
dynamical systems over a finite time horizon. We formulate the problem as a
multi-step optimization problem under uncertainty seeking to identify a system
design and a control policy that jointly maximize the expected sum of rewards
collected over the time horizon considered. The transition function, the reward
function and the policy are all parametrized, assumed known and differentiable
with respect to their parameters. We then introduce a deep reinforcement
learning algorithm combining policy gradient methods with model-based
optimization techniques to solve this problem. In essence, our algorithm
iteratively approximates the gradient of the expected return via Monte-Carlo
sampling and automatic differentiation and takes projected gradient ascent
steps in the space of environment and policy parameters. This algorithm is
referred to as Direct Environment and Policy Search (DEPS). We assess the
performance of our algorithm in three environments concerned with the design
and control of a mass-spring-damper system, a small-scale off-grid power system
and a drone, respectively. In addition, our algorithm is benchmarked against a
state-of-the-art deep reinforcement learning algorithm used to tackle joint
design and control problems. We show that DEPS performs at least as well or
better in all three environments, consistently yielding solutions with higher
returns in fewer iterations. Finally, solutions produced by our algorithm are
also compared with solutions produced by an algorithm that does not jointly
optimize environment and policy parameters, highlighting the fact that higher
returns can be achieved when joint optimization is performed.
Related papers
- A Simulation-Free Deep Learning Approach to Stochastic Optimal Control [12.699529713351287]
We propose a simulation-free algorithm for the solution of generic problems in optimal control (SOC)
Unlike existing methods, our approach does not require the solution of an adjoint problem.
arXiv Detail & Related papers (2024-10-07T16:16:53Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - A Robust Policy Bootstrapping Algorithm for Multi-objective
Reinforcement Learning in Non-stationary Environments [15.794728813746397]
Multi-objective reinforcement learning methods fuse the reinforcement learning paradigm with multi-objective optimization techniques.
One major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment.
We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments.
arXiv Detail & Related papers (2023-08-18T02:15:12Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective
Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions.
We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z) - Policy Optimization for Stochastic Shortest Path [43.2288319750466]
We study policy optimization for the shortest path (SSP) problem.
We propose a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model.
For most settings, our algorithm is shown to achieve a near-optimal regret bound.
arXiv Detail & Related papers (2022-02-07T16:25:14Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - Robust Reinforcement Learning with Wasserstein Constraint [49.86490922809473]
We show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm.
The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.
arXiv Detail & Related papers (2020-06-01T13:48:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.