Related papers: Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

URL: http://arxiv.org/abs/2304.07278v2
Date: Thu, 23 May 2024 13:16:25 GMT
Title: Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning
Authors: Gen Li, Yuling Yan, Yuxin Chen, Jianqing Fan,
Abstract summary: This paper studies reward-agnostic exploration in reinforcement learning (RL) Consider a finite-horizon inhomogeneous decision process with $S$ states, $A$ actions, and a horizon length $H$. Our algorithm is able to yield $varepsilon$ accuracy for arbitrarily many reward functions.
Score: 17.239062061431646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

Related papers

Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z)
Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning [26.067411894141863]
An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL) Human-in-the-loop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback. We provide an active-learning-based RL algorithm that first explores the environment without specifying a reward function.
arXiv Detail & Related papers (2023-04-18T12:36:09Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs) At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward. We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm. Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)
On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z)
Task-agnostic Exploration in Reinforcement Learning [35.403304641170386]
We present an efficient task-agnostic reinforcement learning algorithm, textscUCBZero. It finds $epsilon$-optimal policies for $N$ arbitrary tasks after at most $tilde O(log(N)H5SA/epsilon2)$ exploration episodes. We also provide an $Omega(log (N)H2SA/epsilon2)$ lower bound, showing that the $log$ dependency on $N$ is unavoidable.
arXiv Detail & Related papers (2020-06-16T20:23:41Z)
Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994. We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.