Accommodating Picky Customers: Regret Bound and Exploration Complexity
for Multi-Objective Reinforcement Learning
- URL: http://arxiv.org/abs/2011.13034v3
- Date: Wed, 27 Oct 2021 20:28:04 GMT
- Title: Accommodating Picky Customers: Regret Bound and Exploration Complexity
for Multi-Objective Reinforcement Learning
- Authors: Jingfeng Wu, Vladimir Braverman, Lin F. Yang
- Abstract summary: We consider multi-objective reinforcement learning where the objectives are balanced using preferences.
We formalize this problem as an episodic learning problem on a Markov decision process.
We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $widetildemathcalObigl(sqrtmind,Scdot H3 SA/epsilon2bigr)$.
- Score: 43.75491612671571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we consider multi-objective reinforcement learning where the
objectives are balanced using preferences. In practice, the preferences are
often given in an adversarial manner, e.g., customers can be picky in many
applications. We formalize this problem as an episodic learning problem on a
Markov decision process, where transitions are unknown and a reward function is
the inner product of a preference vector with pre-specified multi-objective
reward functions. We consider two settings. In the online setting, the agent
receives a (adversarial) preference every episode and proposes policies to
interact with the environment. We provide a model-based algorithm that achieves
a nearly minimax optimal regret bound
$\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)$, where $d$
is the number of objectives, $S$ is the number of states, $A$ is the number of
actions, $H$ is the length of the horizon, and $K$ is the number of episodes.
Furthermore, we consider preference-free exploration, i.e., the agent first
interacts with the environment without specifying any preference and then is
able to accommodate arbitrary preference vector up to $\epsilon$ error. Our
proposed algorithm is provably efficient with a nearly optimal trajectory
complexity $\widetilde{\mathcal{O}}\bigl({\min\{d,S\}\cdot H^3
SA}/{\epsilon^2}\bigr)$. This result partly resolves an open problem raised by
\citet{jin2020reward}.
Related papers
- Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary
Environments [40.027926921772355]
We study the study of dynamic regret for goal-oriented reinforcement learning.
The different roles of $Delta_c$ and $Delta_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately.
arXiv Detail & Related papers (2022-05-25T20:29:01Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.