Regularization and Variance-Weighted Regression Achieves Minimax
Optimality in Linear MDPs: Theory and Practice
- URL: http://arxiv.org/abs/2305.13185v1
- Date: Mon, 22 May 2023 16:13:05 GMT
- Title: Regularization and Variance-Weighted Regression Achieves Minimax
Optimality in Linear MDPs: Theory and Practice
- Authors: Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard,
Michal Valko, Wenhao Yang, Jincheng Mei, Pierre M\'enard, Mohammad Gheshlaghi
Azar, R\'emi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesv\'ari,
Wataru Kumagai, Yutaka Matsuo
- Abstract summary: Mirror descent value iteration (MDVI) is an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL)
We study MDVI with linear function approximation through its sample complexity required to identify an $varepsilon$-optimal policy.
We present Variance-Weighted Least-Squares MDVI, the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs.
- Score: 79.48432795639403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler
(KL) and entropy-regularized reinforcement learning (RL), has served as the
basis for recent high-performing practical RL algorithms. However, despite the
use of function approximation in practice, the theoretical understanding of
MDVI has been limited to tabular Markov decision processes (MDPs). We study
MDVI with linear function approximation through its sample complexity required
to identify an $\varepsilon$-optimal policy with probability $1-\delta$ under
the settings of an infinite-horizon linear MDP, generative model, and G-optimal
design. We demonstrate that least-squares regression weighted by the variance
of an estimated optimal value function of the next state is crucial to
achieving minimax optimality. Based on this observation, we present
Variance-Weighted Least-Squares MDVI (VWLS-MDVI), the first theoretical
algorithm that achieves nearly minimax optimal sample complexity for
infinite-horizon linear MDPs. Furthermore, we propose a practical VWLS
algorithm for value-based deep RL, Deep Variance Weighting (DVW). Our
experiments demonstrate that DVW improves the performance of popular
value-based deep RL algorithms on a set of MinAtar benchmarks.
Related papers
- A Variance Minimization Approach to Temporal-Difference Learning [12.026021568207206]
This paper introduces a variance minimization (VM) approach for value-based RL instead of error minimization.
Based on this approach, we proposed two objectives, the Variance of Bellman Error (VBE) and the Variance of Projected Bellman Error (VPBE)
arXiv Detail & Related papers (2024-11-10T08:56:16Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement
Learning in Discounted Linear MDPs [16.006893624836554]
We propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE)
VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step.
In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct.
arXiv Detail & Related papers (2023-10-17T18:27:27Z) - Provably Efficient Algorithm for Nonstationary Low-Rank MDPs [48.92657638730582]
We make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time.
We propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL.
For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with sample complexity.
arXiv Detail & Related papers (2023-08-10T09:52:44Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Nearly Minimax Optimal Offline Reinforcement Learning with Linear
Function Approximation: Single-Agent MDP and Markov Game [34.69723238900705]
offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment.
We propose two new algorithms for offline single-agent MDPs and two-player zero-sum Markov games (MGs)
To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
arXiv Detail & Related papers (2022-05-31T02:50:17Z) - KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal [70.15267479220691]
We consider and analyze the sample complexity of model reinforcement learning with a generative variance-free model.
Our analysis shows that it is nearly minimax-optimal for finding an $varepsilon$-optimal policy when $varepsilon$ is sufficiently small.
arXiv Detail & Related papers (2022-05-27T19:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.