Related papers: Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver

Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver

URL: http://arxiv.org/abs/2110.03244v2
Date: Fri, 8 Oct 2021 01:49:27 GMT
Title: Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver
Authors: Xiaoyu Chen, Jiachen Hu, Lin F. Yang, Liwei Wang
Abstract summary: We provide approaches to learn an RL model efficiently without the guidance of a reward signal. In particular, we take a plug-in solver approach, where we focus on learning a model in the exploration phase. We show that, by establishing a novel exploration algorithm, the plug-in approach learns a model by taking $tildeO(d2H3/epsilon2)$ interactions with the environment.
Score: 32.212146650873194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although model-based reinforcement learning (RL) approaches are considered more sample efficient, existing algorithms are usually relying on sophisticated planning algorithm to couple tightly with the model-learning procedure. Hence the learned models may lack the ability of being re-used with more specialized planners. In this paper we address this issue and provide approaches to learn an RL model efficiently without the guidance of a reward signal. In particular, we take a plug-in solver approach, where we focus on learning a model in the exploration phase and demand that \emph{any planning algorithm} on the learned model can give a near-optimal policy. Specicially, we focus on the linear mixture MDP setting, where the probability transition matrix is a (unknown) convex combination of a set of existing models. We show that, by establishing a novel exploration algorithm, the plug-in approach learns a model by taking $\tilde{O}(d^2H^3/\epsilon^2)$ interactions with the environment and \emph{any} $\epsilon$-optimal planner on the model gives an $O(\epsilon)$-optimal policy on the original model. This sample complexity matches lower bounds for non-plug-in approaches and is \emph{statistically optimal}. We achieve this result by leveraging a careful maximum total-variance bound using Bernstein inequality and properties specified to linear mixture MDP.

Related papers

The Plug-in Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis [6.996002801232415]
We study the sample complexity of the plug-in approach for learning $varepsilon$-optimal policies in average-reward Markov decision processes. Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed.
arXiv Detail & Related papers (2024-10-10T05:08:14Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs) At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward. We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z)
Online Model Selection for Reinforcement Learning with Function Approximation [50.008542459050155]
We present a meta-algorithm that adapts to the optimal complexity with $tildeO(L5/6 T2/3)$ regret. We also show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds.
arXiv Detail & Related papers (2020-11-19T10:00:54Z)
Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning? [30.065091907118827]
This work considers sample complexity of finding an $epsilon$-optimal policy in a Markov decision process (MDP) We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver. We show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-10-12T13:13:01Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret [44.374427255708135]
We propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors.
arXiv Detail & Related papers (2020-06-08T05:09:32Z)
Learning Gaussian Graphical Models via Multiplicative Weights [54.252053139374205]
We adapt an algorithm of Klivans and Meka based on the method of multiplicative weight updates. The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature. It has a low runtime $O(mp2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.
arXiv Detail & Related papers (2020-02-20T10:50:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.