COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation
- URL: http://arxiv.org/abs/2204.08957v1
- Date: Tue, 19 Apr 2022 15:55:47 GMT
- Title: COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation
- Authors: Jongmin Lee, Cosmin Paduraru, Daniel J. Mankowitz, Nicolas Heess,
Doina Precup, Kee-Eung Kim, Arthur Guez
- Abstract summary: offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
- Score: 73.17078343706909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the offline constrained reinforcement learning (RL) problem, in
which the agent aims to compute a policy that maximizes expected return while
satisfying given cost constraints, learning only from a pre-collected dataset.
This problem setting is appealing in many real-world scenarios, where direct
interaction with the environment is costly or risky, and where the resulting
policy should comply with safety constraints. However, it is challenging to
compute a policy that guarantees satisfying the cost constraints in the offline
RL setting, since the off-policy evaluation inherently has an estimation error.
In this paper, we present an offline constrained RL algorithm that optimizes
the policy in the space of the stationary distribution. Our algorithm,
COptiDICE, directly estimates the stationary distribution corrections of the
optimal policy with respect to returns, while constraining the cost upper
bound, with the goal of yielding a cost-conservative policy for actual
constraint satisfaction. Experimental results show that COptiDICE attains
better policies in terms of constraint satisfaction and return-maximization,
outperforming baseline algorithms.
Related papers
- Resilient Constrained Reinforcement Learning [87.4374430686956]
We study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before study.
It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward training objective and the constraint satisfaction.
We propose a new constrained RL approach that searches for policy and constraint specifications together.
arXiv Detail & Related papers (2023-12-28T18:28:23Z) - Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - Distributional constrained reinforcement learning for supply chain
optimization [0.0]
We introduce Distributional Constrained Policy Optimization (DCPO), a novel approach for reliable constraint satisfaction in reinforcement learning.
We show that DCPO improves the rate at which the RL policy converges and ensures reliable constraint satisfaction by the end of training.
arXiv Detail & Related papers (2023-02-03T13:43:02Z) - Optimal Conservative Offline RL with General Function Approximation via
Augmented Lagrangian [18.2080757218886]
offline reinforcement learning (RL) refers to decision-making from a previously-collected dataset of interactions.
We present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability.
arXiv Detail & Related papers (2022-11-01T19:28:48Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Off-Policy Optimization of Portfolio Allocation Policies under
Constraints [0.8848340429852071]
Dynamic portfolio optimization problem in finance frequently requires learning policies that adhere to various constraints, driven by investor preferences and risk.
We motivate this problem of finding an allocation policy within a sequential decision making framework and study the effects of: (a) using data collected under previously employed policies, which may be sub-optimal and constraint-violating, and (b) imposing desired constraints while computing near-optimal policies with this data.
arXiv Detail & Related papers (2020-12-21T22:22:04Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.