Policy Finetuning: Bridging Sample-Efficient Offline and Online
Reinforcement Learning
- URL: http://arxiv.org/abs/2106.04895v1
- Date: Wed, 9 Jun 2021 08:28:55 GMT
- Title: Policy Finetuning: Bridging Sample-Efficient Offline and Online
Reinforcement Learning
- Authors: Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, Yu Bai
- Abstract summary: This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy"
We first design a sharp offline reduction algorithm that finds an $varepsilon$ near-optimal policy within $widetildeO(H3SCstar/varepsilon2)$ episodes.
We then establish an $Omega(H3SminCstar, A/varepsilon2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the
- Score: 59.02541753781001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent theoretical work studies sample-efficient reinforcement learning (RL)
extensively in two settings: learning interactively in the environment (online
RL), or learning from an offline dataset (offline RL). However, existing
algorithms and theories for learning near-optimal policies in these two
settings are rather different and disconnected. Towards bridging this gap, this
paper initiates the theoretical study of policy finetuning, that is, online RL
where the learner has additional access to a "reference policy" $\mu$ close to
the optimal policy $\pi_\star$ in a certain sense. We consider the policy
finetuning problem in episodic Markov Decision Processes (MDPs) with $S$
states, $A$ actions, and horizon length $H$. We first design a sharp offline
reduction algorithm -- which simply executes $\mu$ and runs offline policy
optimization on the collected dataset -- that finds an $\varepsilon$
near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes,
where $C^\star$ is the single-policy concentrability coefficient between $\mu$
and $\pi_\star$. This offline result is the first that matches the sample
complexity lower bound in this setting, and resolves a recent open question in
offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$
sample complexity lower bound for any policy finetuning algorithm, including
those that can adaptively explore the environment. This implies that -- perhaps
surprisingly -- the optimal policy finetuning algorithm is either offline
reduction or a purely online RL algorithm that does not use $\mu$. Finally, we
design a new hybrid offline/online algorithm for policy finetuning that
achieves better sample complexity than both vanilla offline reduction and
purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies
concentrability partially up to a certain time step.
Related papers
- A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs [18.449996575976993]
We propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting.
Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(epsilon-2)$ with partial data coverage assumption.
We extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
arXiv Detail & Related papers (2024-02-07T00:33:11Z) - Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path [106.37656068276902]
We study the sample complexity of learning an $epsilon$-optimal policy in the Shortest Path (SSP) problem.
We derive complexity bounds when the learner has access to a generative model.
We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_min$, and maximum expected cost of the optimal policy over all states $B_star$.
arXiv Detail & Related papers (2022-10-10T18:34:32Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Nearly Optimal Policy Optimization with Stable at Any Time Guarantee [53.155554415415445]
Policy-based method in citetshani 2020optimistic is only $tildeO(sqrtSAH3K + sqrtAH4K)$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $sqrtSH$ gap compared with the information theoretic lower bound $tildeOmega(sqrtSAH
arXiv Detail & Related papers (2021-12-21T01:54:17Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Characterizing Uniform Convergence in Offline Policy Evaluation via
model-based approach: Offline Learning, Task-Agnostic and Reward-Free [34.54294677335518]
We study the statistical limits of uniform convergence for offline policy evaluation problems (uniform OPE for short) with model-based methods under episodic MDP setting.
Our main result establishes an episode complexity of $tildeO(H2/d_mepsilon2)$ for emphnear-empirically optimal policies for the MDPs with emphstationary transition.
arXiv Detail & Related papers (2021-05-13T01:36:34Z) - Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation
for Reinforcement Learning [43.61029925616256]
offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications.
We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE.
Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
arXiv Detail & Related papers (2020-07-07T19:44:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.