The Benefits of Being Distributional: Small-Loss Bounds for
Reinforcement Learning
- URL: http://arxiv.org/abs/2305.15703v3
- Date: Sat, 23 Sep 2023 03:31:55 GMT
- Title: The Benefits of Being Distributional: Small-Loss Bounds for
Reinforcement Learning
- Authors: Kaiwen Wang and Kevin Zhou and Runzhe Wu and Nathan Kallus and Wen Sun
- Abstract summary: This paper explains the benefits of distributional reinforcement learning (DistRL) through the lens of small-loss bounds.
In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation.
In offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.
- Score: 43.9624940128166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While distributional reinforcement learning (DistRL) has been empirically
effective, the question of when and why it is better than vanilla,
non-distributional RL has remained unanswered. This paper explains the benefits
of DistRL through the lens of small-loss bounds, which are instance-dependent
bounds that scale with optimal achievable cost. Particularly, our bounds
converge much faster than those from non-distributional approaches if the
optimal cost is small. As warmup, we propose a distributional contextual bandit
(DistCB) algorithm, which we show enjoys small-loss regret bounds and
empirically outperforms the state-of-the-art on three real-world tasks. In
online RL, we propose a DistRL algorithm that constructs confidence sets using
maximum likelihood estimation. We prove that our algorithm enjoys novel
small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce
the $\ell_1$ distributional eluder dimension which may be of independent
interest. Then, in offline RL, we show that pessimistic DistRL enjoys
small-loss PAC bounds that are novel to the offline setting and are more robust
to bad single-policy coverage.
Related papers
- Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback [56.6950165117658]
We consider offline reinforcement learning with preference feedback in which the implicit reward is a linear function of an unknown parameter.
We propose an algorithm, underlineRL with underlineLocally underlineOptimal underlineWeights or sc RL-LOW, which yields a simple regret of $exp.
We observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of sc RL-LOW.
arXiv Detail & Related papers (2024-06-18T02:03:12Z) - More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z) - Optimistic PAC Reinforcement Learning: the Instance-Dependent View [24.256960622176305]
We present an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available.
While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap.
In MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal.
arXiv Detail & Related papers (2022-07-12T21:35:03Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Settling the Communication Complexity for Distributed Offline
Reinforcement Learning [10.315054389907031]
We study a novel setting in offline reinforcement learning (RL) where a number of distributed machines jointly cooperate to solve the problem.
There is a budget constraint on the total number of information (in terms of bits) that each machine can send out.
For value function prediction in contextual bandits, and both episodic and non-episodic MDPs, we establish information-theoretic lower bounds on the minimax risk.
arXiv Detail & Related papers (2022-02-10T06:27:07Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.