Randomized Policy Optimization for Optimal Stopping
- URL: http://arxiv.org/abs/2203.13446v1
- Date: Fri, 25 Mar 2022 04:33:15 GMT
- Title: Randomized Policy Optimization for Optimal Stopping
- Authors: Xinyi Guan, Velibor V. Mi\v{s}i\'c
- Abstract summary: We propose a new methodology for optimal stopping based on randomized linear policies.
We show that our approach can substantially outperform state-of-the-art methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optimal stopping is the problem of determining when to stop a stochastic
system in order to maximize reward, which is of practical importance in domains
such as finance, operations management and healthcare. Existing methods for
high-dimensional optimal stopping that are popular in practice produce
deterministic linear policies -- policies that deterministically stop based on
the sign of a weighted sum of basis functions -- but are not guaranteed to find
the optimal policy within this policy class given a fixed basis function
architecture. In this paper, we propose a new methodology for optimal stopping
based on randomized linear policies, which choose to stop with a probability
that is determined by a weighted sum of basis functions. We motivate these
policies by establishing that under mild conditions, given a fixed basis
function architecture, optimizing over randomized linear policies is equivalent
to optimizing over deterministic linear policies. We formulate the problem of
learning randomized linear policies from data as a smooth non-convex sample
average approximation (SAA) problem. We theoretically prove the almost sure
convergence of our randomized policy SAA problem and establish bounds on the
out-of-sample performance of randomized policies obtained from our SAA problem
based on Rademacher complexity. We also show that the SAA problem is in general
NP-Hard, and consequently develop a practical heuristic for solving our
randomized policy problem. Through numerical experiments on a benchmark family
of option pricing problem instances, we show that our approach can
substantially outperform state-of-the-art methods.
Related papers
- Strongly-polynomial time and validation analysis of policy gradient methods [3.722665817361884]
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL)
By incorporating this advantage gap function into the design of step size rules, we deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy.
This is the first time that such strong convergence properties have been established for policy gradient methods.
arXiv Detail & Related papers (2024-09-28T18:56:48Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Sample Complexity of Policy-Based Methods under Off-Policy Sampling and
Linear Function Approximation [8.465228064780748]
off-policy sampling and linear function approximation are employed for policy evaluation.
Various policy update rules, including natural policy gradient (NPG), are considered for policy update.
We establish for the first time an overall $mathcalO(epsilon-2)$ sample complexity for finding an optimal policy.
arXiv Detail & Related papers (2022-08-05T15:59:05Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - A Reinforcement Learning Approach to the Stochastic Cutting Stock
Problem [0.0]
We propose a formulation of the cutting stock problem as a discounted infinite-horizon decision process.
An optimal solution corresponds to a policy that associates each state with a decision and minimizes the expected total cost.
arXiv Detail & Related papers (2021-09-20T14:47:54Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.