A Boosting Approach to Reinforcement Learning
- URL: http://arxiv.org/abs/2108.09767v1
- Date: Sun, 22 Aug 2021 16:00:45 GMT
- Title: A Boosting Approach to Reinforcement Learning
- Authors: Nataly Brukhim, Elad Hazan, Karan Singh
- Abstract summary: We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
- Score: 59.46285581748018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study efficient algorithms for reinforcement learning in Markov decision
processes whose complexity is independent of the number of states. This
formulation succinctly captures large scale problems, but is also known to be
computationally hard in its general form. Previous approaches attempt to
circumvent the computational hardness by assuming structure in either
transition function or the value function, or by relaxing the solution
guarantee to a local optimality condition.
We consider the methodology of boosting, borrowed from supervised learning,
for converting weak learners into an accurate policy. The notion of weak
learning we study is that of sampled-based approximate optimization of linear
functions over policies. Under this assumption of weak learnability, we give an
efficient algorithm that is capable of improving the accuracy of such weak
learning methods, till global optimality is reached. We prove sample complexity
and running time bounds on our method, that are polynomial in the natural
parameters of the problem: approximation guarantee, discount factor,
distribution mismatch and number of actions. In particular, our bound does not
depend on the number of states.
A technical difficulty in applying previous boosting results, is that the
value function over policy space is not convex. We show how to use a non-convex
variant of the Frank-Wolfe method, coupled with recent advances in gradient
boosting that allow incorporating a weak learner with multiplicative
approximation guarantee, to overcome the non-convexity and attain global
convergence.
Related papers
- Sample-Efficient Agnostic Boosting [19.15484761265653]
Empirical Risk Minimization (ERM) outstrips the agnostic boosting methodology in being quadratically more sample efficient than all known boosting algorithms.
A key feature of our algorithm is that it leverages the ability to reuse samples across multiple rounds of boosting, while guaranteeing a generalization error strictly better than those obtained by blackbox applications of uniform convergence arguments.
arXiv Detail & Related papers (2024-10-31T04:50:29Z) - Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Offline Reinforcement Learning with Differentiable Function
Approximation is Provably Efficient [65.08966446962845]
offline reinforcement learning, which aims at optimizing decision-making strategies with historical data, has been extensively applied in real-life applications.
We take a step by considering offline reinforcement learning with differentiable function class approximation (DFA)
Most importantly, we show offline differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning algorithm.
arXiv Detail & Related papers (2022-10-03T07:59:42Z) - Stochastic convex optimization for provably efficient apprenticeship
learning [1.0609815608017066]
We consider large-scale Markov decision processes (MDPs) with an unknown cost function.
We employ convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations.
arXiv Detail & Related papers (2021-12-31T19:47:57Z) - Uniform-PAC Bounds for Reinforcement Learning with Linear Function
Approximation [92.3161051419884]
We study reinforcement learning with linear function approximation.
Existing algorithms only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees.
We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability.
arXiv Detail & Related papers (2021-06-22T08:48:56Z) - Parameter-free Gradient Temporal Difference Learning [3.553493344868414]
We develop gradient-based temporal difference algorithms for reinforcement learning.
Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $log$ factors.
Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
arXiv Detail & Related papers (2021-05-10T06:07:05Z) - Meta-Regularization: An Approach to Adaptive Choice of the Learning Rate
in Gradient Descent [20.47598828422897]
We propose textit-Meta-Regularization, a novel approach for the adaptive choice of the learning rate in first-order descent methods.
Our approach modifies the objective function by adding a regularization term, and casts the joint process parameters.
arXiv Detail & Related papers (2021-04-12T13:13:34Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Augmenting GAIL with BC for sample efficient imitation learning [5.199454801210509]
We present a simple and elegant method to combine behavior cloning and GAIL to enable stable and sample efficient learning.
Our algorithm is very simple to implement and integrates with different policy gradient algorithms.
We demonstrate the effectiveness of the algorithm in low dimensional control tasks, gridworlds and in high dimensional image-based tasks.
arXiv Detail & Related papers (2020-01-21T22:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.