Simple and optimal methods for stochastic variational inequalities, II:
Markovian noise and policy evaluation in reinforcement learning
- URL: http://arxiv.org/abs/2011.08434v4
- Date: Fri, 13 Aug 2021 23:14:14 GMT
- Title: Simple and optimal methods for stochastic variational inequalities, II:
Markovian noise and policy evaluation in reinforcement learning
- Authors: Georgios Kotsalis and Guanghui Lan and Tianjiao Li
- Abstract summary: This paper focuses on resetting variational inequalities (VI) under Markovian noise.
A prominent application of our algorithmic developments is the policy evaluation problem in reinforcement learning.
- Score: 9.359939442911127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The focus of this paper is on stochastic variational inequalities (VI) under
Markovian noise. A prominent application of our algorithmic developments is the
stochastic policy evaluation problem in reinforcement learning. Prior
investigations in the literature focused on temporal difference (TD) learning
by employing nonsmooth finite time analysis motivated by stochastic subgradient
descent leading to certain limitations. These encompass the requirement of
analyzing a modified TD algorithm that involves projection to an a-priori
defined Euclidean ball, achieving a non-optimal convergence rate and no clear
way of deriving the beneficial effects of parallel implementation. Our approach
remedies these shortcomings in the broader context of stochastic VIs and in
particular when it comes to stochastic policy evaluation. We developed a
variety of simple TD learning type algorithms motivated by its original version
that maintain its simplicity, while offering distinct advantages from a
non-asymptotic analysis point of view. We first provide an improved analysis of
the standard TD algorithm that can benefit from parallel implementation. Then
we present versions of a conditional TD algorithm (CTD), that involves periodic
updates of the stochastic iterates, which reduce the bias and therefore exhibit
improved iteration complexity. This brings us to the fast TD (FTD) algorithm
which combines elements of CTD and the stochastic operator extrapolation method
of the companion paper. For a novel index resetting policy FTD exhibits the
best known convergence rate. We also devised a robust version of the algorithm
that is particularly suitable for discounting factors close to 1.
Related papers
- Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process.
For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature.
We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Variance reduction techniques for stochastic proximal point algorithms [5.374800961359305]
In this work, we propose the first unified study of variance reduction techniques for proximal point algorithms.
We introduce a generic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants.
Our experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts.
arXiv Detail & Related papers (2023-08-18T05:11:50Z) - The Stochastic Proximal Distance Algorithm [5.3315823983402755]
We propose and analyze a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter.
We extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates.
We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.
arXiv Detail & Related papers (2022-10-21T22:07:28Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Byzantine-Resilient Non-Convex Stochastic Gradient Descent [61.6382287971982]
adversary-resilient distributed optimization, in which.
machines can independently compute gradients, and cooperate.
Our algorithm is based on a new concentration technique, and its sample complexity.
It is very practical: it improves upon the performance of all prior methods when no.
setting machines are present.
arXiv Detail & Related papers (2020-12-28T17:19:32Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Proximal Gradient Temporal Difference Learning: Stable Reinforcement
Learning with Polynomial Sample Complexity [40.73281056650241]
We introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true gradient temporal difference learning algorithms.
We show how gradient TD reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function.
arXiv Detail & Related papers (2020-06-06T21:04:21Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.