Batch Policy Learning in Average Reward Markov Decision Processes
- URL: http://arxiv.org/abs/2007.11771v3
- Date: Sat, 17 Sep 2022 17:56:24 GMT
- Title: Batch Policy Learning in Average Reward Markov Decision Processes
- Authors: Peng Liao, Zhengling Qi, Runzhe Wan, Predrag Klasnja, Susan Murphy
- Abstract summary: Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward.
We develop an optimization algorithm to compute the optimal policy in a parameterized policy class.
The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy.
- Score: 3.9023554886892438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the batch (off-line) policy learning problem in the infinite
horizon Markov Decision Process. Motivated by mobile health applications, we
focus on learning a policy that maximizes the long-term average reward. We
propose a doubly robust estimator for the average reward and show that it
achieves semiparametric efficiency. Further we develop an optimization
algorithm to compute the optimal policy in a parameterized stochastic policy
class. The performance of the estimated policy is measured by the difference
between the optimal average reward in the policy class and the average reward
of the estimated policy and we establish a finite-sample regret guarantee. The
performance of the method is illustrated by simulation studies and an analysis
of a mobile health study promoting physical activity.
Related papers
- Actor-Critic Reinforcement Learning with Phased Actor [10.577516871906816]
We propose a novel phased actor in actor-critic (PAAC) method to improve policy gradient estimation.
PAAC accounts for both $Q$ value and TD error in its actor update.
Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate.
arXiv Detail & Related papers (2024-04-18T01:27:31Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Off-Policy Average Reward Actor-Critic with Deterministic Policy Search [3.551625533648956]
We present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion.
We also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm.
We compare the average reward performance of our proposed ARO-DDPG and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
arXiv Detail & Related papers (2023-05-20T17:13:06Z) - Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation.
By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Robust Batch Policy Learning in Markov Decision Processes [0.0]
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
arXiv Detail & Related papers (2020-11-09T04:41:21Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.