Related papers: Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

URL: http://arxiv.org/abs/2006.01738v4
Date: Thu, 6 Jan 2022 12:25:26 GMT
Title: Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent
Authors: Adrien Bolland, Ioannis Boukas, Mathias Berger, Damien Ernst
Abstract summary: We introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations.
Score: 3.118384520557952
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.

Related papers

Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z)
Predictive Lagrangian Optimization for Constrained Reinforcement Learning [15.082498910832529]
Constrained optimization is popularly seen in reinforcement learning for addressing complex control tasks.<n>In this paper, we propose a more generic equivalence framework to build the connection between constrained optimization and feedback control system.
arXiv Detail & Related papers (2025-01-25T13:39:45Z)
A Simulation-Free Deep Learning Approach to Stochastic Optimal Control [12.699529713351287]
We propose a simulation-free algorithm for the solution of generic problems in optimal control (SOC) Unlike existing methods, our approach does not require the solution of an adjoint problem.
arXiv Detail & Related papers (2024-10-07T16:16:53Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments [15.794728813746397]
Multi-objective reinforcement learning methods fuse the reinforcement learning paradigm with multi-objective optimization techniques. One major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments.
arXiv Detail & Related papers (2023-08-18T02:15:12Z)
Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates. We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change. We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions. We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z)
Policy Optimization for Stochastic Shortest Path [43.2288319750466]
We study policy optimization for the shortest path (SSP) problem. We propose a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model. For most settings, our algorithm is shown to achieve a near-optimal regret bound.
arXiv Detail & Related papers (2022-02-07T16:25:14Z)
Zeroth-Order Actor-Critic: An Evolutionary Framework for Sequential Decision Problems [17.713459311502636]
We propose a novel evolutionary framework Zeroth-Order Actor-Critic (ZOAC) to solve sequential decision problems (SDPs) ZOAC uses step-wise exploration in parameter space and theoretically derive the zeroth-order policy gradient. It significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information.
arXiv Detail & Related papers (2022-01-29T07:09:03Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications. We propose a learning algorithm that decouples the action constraints from the policy parameter update. We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z)
Robust Reinforcement Learning with Wasserstein Constraint [49.86490922809473]
We show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm. The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.
arXiv Detail & Related papers (2020-06-01T13:48:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.