Related papers: Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

URL: http://arxiv.org/abs/2003.03924v4
Date: Mon, 24 Aug 2020 04:09:22 GMT
Title: Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison
Authors: Tengyang Xie, Nan Jiang
Abstract summary: We prove performance guarantees of two algorithms for approxing $Qstar$ in batch reinforcement learning. One of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation.
Score: 17.692408242465763
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We prove performance guarantees of two algorithms for approximating $Q^\star$ in batch reinforcement learning. Compared to classical iterative methods such as Fitted Q-Iteration---whose performance loss incurs quadratic dependence on horizon---these methods estimate (some forms of) the Bellman error and enjoy linear-in-horizon error propagation, a property established for the first time for algorithms that rely solely on batch data and output stationary policies. One of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation, and does not use any squared losses. Our analyses reveal its distinct characteristics and potential advantages compared to classical algorithms.

Related papers

Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability.<n>We validate our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity.
arXiv Detail & Related papers (2025-06-03T02:56:26Z)
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two principled algorithms for the test-time compute of large language models. We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
Bregman-divergence-based Arimoto-Blahut algorithm [53.64687146666141]
We generalize the Arimoto-Blahut algorithm to a general function defined over Bregman-divergence system. We propose a convex-optimization-free algorithm that can be applied to classical and quantum rate-distortion theory.
arXiv Detail & Related papers (2024-08-10T06:16:24Z)
Regularized Q-Learning with Linear Function Approximation [2.765106384328772]
We consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation. We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise.
arXiv Detail & Related papers (2024-01-26T20:45:40Z)
Parameterized Projected Bellman Operator [64.129598593852]
Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) We propose a novel alternative approach based on learning an approximate version of the Bellman operator. We formulate an optimization problem to learn PBO for generic sequential decision-making problems.
arXiv Detail & Related papers (2023-12-20T09:33:16Z)
A Log-linear Gradient Descent Algorithm for Unbalanced Binary Classification using the All Pairs Squared Hinge Loss [0.0]
We propose a new functional representation of the square loss and squared hinge loss, which results in algorithms that compute the gradient in either linear or log-linear time. Our new algorithm can achieve higher test AUC values on imbalanced data sets than previous algorithms, and make use of larger batch sizes than were previously feasible.
arXiv Detail & Related papers (2023-02-21T23:35:00Z)
Dictionary and prior learning with unrolled algorithms for unsupervised inverse problems [12.54744464424354]
We study Dictionary and Prior learning from degraded measurements as a bi-level problem. We take advantage of unrolled algorithms to solve approximate formulations of Synthesis and Analysis.
arXiv Detail & Related papers (2021-06-11T12:21:26Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
Accelerated Message Passing for Entropy-Regularized MAP Inference [89.15658822319928]
Maximum a posteriori (MAP) inference in discrete-valued random fields is a fundamental problem in machine learning. Due to the difficulty of this problem, linear programming (LP) relaxations are commonly used to derive specialized message passing algorithms. We present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient.
arXiv Detail & Related papers (2020-07-01T18:43:32Z)
Learning and Planning in Average-Reward Markov Decision Processes [15.586087060535398]
We introduce learning and planning algorithms for average-reward MDPs. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward.
arXiv Detail & Related papers (2020-06-29T19:03:24Z)
SONIA: A Symmetric Blockwise Truncated Optimization Algorithm [2.9923891863939938]
This work presents a new algorithm for empirical risk. The algorithm bridges the gap between first- and second-order search methods by computing a second-order search-type update in one subspace, coupled with a scaled steepest descent step in the Theoretical complement.
arXiv Detail & Related papers (2020-06-06T19:28:14Z)
Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.