Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation
- URL: http://arxiv.org/abs/2310.18919v2
- Date: Sat, 4 Nov 2023 01:38:40 GMT
- Title: Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation
- Authors: Nikki Lijing Kuang, Ming Yin, Mengdi Wang, Yu-Xiang Wang, Yi-An Ma
- Abstract summary: Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling.
We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays.
We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
- Score: 62.969796245827006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies in reinforcement learning (RL) have made significant progress
by leveraging function approximation to alleviate the sample complexity hurdle
for better performance. Despite the success, existing provably efficient
algorithms typically rely on the accessibility of immediate feedback upon
taking actions. The failure to account for the impact of delay in observations
can significantly degrade the performance of real-world systems due to the
regret blow-up. In this work, we tackle the challenge of delayed feedback in RL
with linear function approximation by employing posterior sampling, which has
been shown to empirically outperform the popular UCB algorithms in a wide range
of regimes. We first introduce Delayed-PSVI, an optimistic value-based
algorithm that effectively explores the value function space via noise
perturbation with posterior sampling. We provide the first analysis for
posterior sampling algorithms with delayed feedback in RL and show our
algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case
regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the
expected delay. To further improve its computational efficiency and to expand
its applicability in high-dimensional RL problems, we incorporate a
gradient-based approximate sampling scheme via Langevin dynamics for
Delayed-LPSVI, which maintains the same order-optimal regret guarantee with
$\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to
demonstrate the statistical and computational efficacy of our algorithms.
Related papers
- Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations [12.696136981847438]
We introduce first-order optimization expedited with approximately parallelized iterations (OptEx)
OptEx is the first framework that enhances the efficiency of FOO by leveraging parallel computing to mitigate its iterative bottleneck.
We provide theoretical guarantees for the reliability of our kernelized gradient estimation and the complexity of SGD-based OptEx.
arXiv Detail & Related papers (2024-02-18T02:19:02Z) - High-dimensional Contextual Bandit Problem without Sparsity [8.782204980889077]
We propose an explore-then-commit (EtC) algorithm to address this problem and examine its performance.
We derive the optimal rate of the ETC algorithm in terms of $T$ and show that this rate can be achieved by balancing exploration and exploitation.
We introduce an adaptive explore-then-commit (AEtC) algorithm that adaptively finds the optimal balance.
arXiv Detail & Related papers (2023-06-19T15:29:32Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Revisiting State Augmentation methods for Reinforcement Learning with
Stochastic Delays [10.484851004093919]
This paper formally describes the notion of Markov Decision Processes (MDPs) with delays.
We show that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure.
We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with delays in actions and observations.
arXiv Detail & Related papers (2021-08-17T10:45:55Z) - Non-asymptotic estimates for TUSLA algorithm for non-convex learning
with applications to neural networks with ReLU activation function [3.5044892799305956]
We provide a non-asymptotic analysis for the tamed un-adjusted Langevin algorithm (TUSLA) introduced in Lovas et al.
In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wassersteinstein-1-2.
We show that the TUSLA algorithm converges rapidly to the optimal solution.
arXiv Detail & Related papers (2021-07-19T07:13:02Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Stochastic Gradient Langevin with Delayed Gradients [29.6870062491741]
We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation.
We show that the rate of convergence in measure is not significantly affected by the error caused by the delayed gradient information used for computation, suggesting significant potential for speedup in wall clock time.
arXiv Detail & Related papers (2020-06-12T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.