Replicability in Reinforcement Learning
- URL: http://arxiv.org/abs/2305.19562v2
- Date: Fri, 27 Oct 2023 23:53:10 GMT
- Title: Replicability in Reinforcement Learning
- Authors: Amin Karbasi, Grigoris Velegkas, Lin F. Yang, Felix Zhou
- Abstract summary: We focus on the fundamental setting of discounted MDPs with access to a generative model.
Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions.
- Score: 46.89386344741442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We initiate the mathematical study of replicability as an algorithmic
property in the context of reinforcement learning (RL). We focus on the
fundamental setting of discounted tabular MDPs with access to a generative
model. Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is
replicable if, with high probability, it outputs the exact same policy after
two executions on i.i.d. samples drawn from the generator when its internal
randomness is the same. We first provide an efficient $\rho$-replicable
algorithm for $(\varepsilon, \delta)$-optimal policy estimation with sample and
time complexity $\widetilde
O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$,
where $N$ is the number of state-action pairs. Next, for the subclass of
deterministic algorithms, we provide a lower bound of order
$\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$.
Then, we study a relaxed version of replicability proposed by Kalavasis et al.
[2023] called TV indistinguishability. We design a computationally efficient TV
indistinguishable algorithm for policy estimation whose sample complexity is
$\widetilde
O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$.
At the cost of $\exp(N)$ running time, we transform these TV indistinguishable
algorithms to $\rho$-replicable ones without increasing their sample
complexity. Finally, we introduce the notion of approximate-replicability where
we only require that two outputted policies are close under an appropriate
statistical divergence (e.g., Renyi) and show an improved sample complexity of
$\widetilde
O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$.
Related papers
- Finding good policies in average-reward Markov Decision Processes without prior knowledge [19.89784209009327]
We revisit the identification of an $varepsilon$-optimal policy in average-reward Decision (MDP)
By relying on a diameter estimation procedure, we propose the first algorithm for $(varepsilon,delta)$-PAC-PAC policy identification.
arXiv Detail & Related papers (2024-05-27T12:24:14Z) - Truncated Variance Reduced Value Iteration [23.282578280033622]
We provide faster randomized algorithms for computing an $epsilon$-optimal policy in a discounted Markov decision process.
Our results take a step to closing the sample-complexity gap between model-free and model-based methods.
arXiv Detail & Related papers (2024-05-21T17:28:06Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Infinite-Horizon Offline Reinforcement Learning with Linear Function
Approximation: Curse of Dimensionality and Algorithm [46.36534144138337]
In this paper, we investigate the sample complexity of policy evaluation in offline reinforcement learning.
Under the low distribution shift assumption, we show that there is an algorithm that needs at most $Oleft(maxleft fracleftVert thetapirightVert _24varepsilon4logfracddelta,frac1varepsilon2left(d+logfrac1deltaright)right right)$ samples to approximate the
arXiv Detail & Related papers (2021-03-17T18:18:57Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.