Constraint-Generation Policy Optimization (CGPO): Nonlinear Programming
for Policy Optimization in Mixed Discrete-Continuous MDPs
- URL: http://arxiv.org/abs/2401.12243v1
- Date: Sat, 20 Jan 2024 07:12:57 GMT
- Title: Constraint-Generation Policy Optimization (CGPO): Nonlinear Programming
for Policy Optimization in Mixed Discrete-Continuous MDPs
- Authors: Michael Gimelfarb, Ayal Taitler, Scott Sanner
- Abstract summary: CGPO provides bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics.
CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions.
We experimentally demonstrate the applicability of CGPO in diverse domains, including inventory control, management of a system of water reservoirs.
- Score: 23.87856533426793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Constraint-Generation Policy Optimization (CGPO) for optimizing
policy parameters within compact and interpretable policy classes for mixed
discrete-continuous Markov Decision Processes (DC-MDPs). CGPO is not only able
to provide bounded policy error guarantees over an infinite range of initial
states for many DC-MDPs with expressive nonlinear dynamics, but it can also
provably derive optimal policies in cases where it terminates with zero error.
Furthermore, CGPO can generate worst-case state trajectories to diagnose policy
deficiencies and provide counterfactual explanations of optimal actions. To
achieve such results, CGPO proposes a bi-level mixed-integer nonlinear
optimization framework for optimizing policies within defined expressivity
classes (i.e. piecewise (non)-linear) and reduces it to an optimal constraint
generation methodology that adversarially generates worst-case state
trajectories. Furthermore, leveraging modern nonlinear optimizers, CGPO can
obtain solutions with bounded optimality gap guarantees. We handle stochastic
transitions through explicit marginalization (where applicable) or
chance-constraints, providing high-probability policy performance guarantees.
We also present a road-map for understanding the computational complexities
associated with different expressivity classes of policy, reward, and
transition dynamics. We experimentally demonstrate the applicability of CGPO in
diverse domains, including inventory control, management of a system of water
reservoirs, and physics control. In summary, we provide a solution for deriving
structured, compact, and explainable policies with bounded performance
guarantees, enabling worst-case scenario generation and counterfactual policy
diagnostics.
Related papers
- Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy.
Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z) - Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces.
We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition.
Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z) - Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form [26.01796404477275]
This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP)
An optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments.
arXiv Detail & Related papers (2024-08-29T06:37:16Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.