Related papers: Improper Learning with Gradient-based Policy Optimization

Improper Learning with Gradient-based Policy Optimization

URL: http://arxiv.org/abs/2102.08201v1
Date: Tue, 16 Feb 2021 14:53:55 GMT
Title: Improper Learning with Gradient-based Policy Optimization
Authors: Mohammadi Zaki, Avinash Mohan, Aditya Gopalan and Shie Mannor
Abstract summary: We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process. We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
Score: 62.50997487685586
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. We derive convergence and convergence rate guarantees for the approach assuming access to a gradient oracle. Numerical results on a challenging constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when each constituent policy at its disposal is unstable.

Related papers

A learning-based approach to stochastic optimal control under reach-avoid constraint [7.036452261968767]
We develop a model-free approach to optimally control, Markovian systems subject to a reach-avoid constraint. We prove that under suitable assumptions, the policy parameters converge to the optimal parameters, while ensuring that the system trajectories satisfy the reach-avoid constraint with high probability.
arXiv Detail & Related papers (2024-12-21T10:07:40Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Learning to Boost the Performance of Stable Nonlinear Systems [0.0]
We tackle the performance-boosting problem with closed-loop stability guarantees. Our methods enable learning over arbitrarily deep neural network classes of performance-boosting controllers for stable nonlinear systems.
arXiv Detail & Related papers (2024-05-01T21:11:29Z)
Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach [3.453622106101339]
We propose a framework towards achieving two intertwined objectives: (i) equipping reinforcement learning with active exploration and deliberate information gathering, and (ii) overcoming the computational intractability of optimal control law. We approach both objectives by using reinforcement learning to compute the optimal control law. Unlike fixed exploration and exploitation balance, caution and probing are employed automatically by the controller in real-time, even after the learning process is terminated.
arXiv Detail & Related papers (2023-09-18T18:05:35Z)
Optimal Control of Nonlinear Systems with Unknown Dynamics [4.551160285910024]
This paper presents a data-driven method for finding a closed-loop optimal controller.<n>It minimizes a specified infinite-horizon cost function for systems with unknown dynamics given any arbitrary initial state.
arXiv Detail & Related papers (2023-05-24T14:27:22Z)
Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs) Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z)
Actor-Critic based Improper Reinforcement Learning [61.430513757337486]
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process. We propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic scheme and a Natural Actor-Critic scheme.
arXiv Detail & Related papers (2022-07-19T05:55:02Z)
Learning Stochastic Optimal Policies via Gradient Descent [17.9807134122734]
We systematically develop a learning-based treatment of optimal control (SOC) We propose a derivation of adjoint sensitivity results for differential equations through direct application of variational calculus. We verify the performance of the proposed approach on a continuous-time, finite horizon portfolio optimization with proportional transaction costs.
arXiv Detail & Related papers (2021-06-07T16:43:07Z)
Closing the Closed-Loop Distribution Shift in Safe Imitation Learning [80.05727171757454]
We treat safe optimization-based control strategies as experts in an imitation learning problem. We train a learned policy that can be cheaply evaluated at run-time and that provably satisfies the same safety guarantees as the expert.
arXiv Detail & Related papers (2021-02-18T05:11:41Z)
Learning Constrained Adaptive Differentiable Predictive Control Policies With Guarantees [1.1086440815804224]
We present differentiable predictive control (DPC), a method for learning constrained neural control policies for linear systems. We employ automatic differentiation to obtain direct policy gradients by backpropagating the model predictive control (MPC) loss function and constraints penalties through a differentiable closed-loop system dynamics model.
arXiv Detail & Related papers (2020-04-23T14:24:44Z)
Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare. We develop an approach that estimates the bounds of a given policy. We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.