Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation
- URL: http://arxiv.org/abs/2405.17061v3
- Date: Thu, 16 Jan 2025 14:45:52 GMT
- Title: Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation
- Authors: Long-Fei Li, Yu-Jie Zhang, Peng Zhao, Zhi-Hua Zhou,
- Abstract summary: We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space.
Despite its significant benefits, incorporating the non-linear function raises substantial challenges in both statistical and computational efficiency.
We propose an algorithm that achieves a regret of $widetildemathcalO(dH2sqrtK + kappa-1d2H2)$, eliminating the dependence on $kappa-1$ in the dominant term for the first time
- Score: 67.8414514524356
- License:
- Abstract: We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its significant benefits, incorporating the non-linear function raises substantial challenges in both statistical and computational efficiency. The best-known result of Hwang and Oh [2023] has achieved an $\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K})$ regret upper bound, where $\kappa$ is a problem-dependent quantity, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. However, we observe that $\kappa^{-1}$ exhibits polynomial dependence on the number of reachable states, which can be as large as the state space size in the worst case and thus undermines the motivation for function approximation. Additionally, their method requires storing all historical data and the time complexity scales linearly with the episode count, which is computationally expensive. In this work, we propose a statistically efficient algorithm that achieves a regret of $\widetilde{\mathcal{O}}(dH^2\sqrt{K} + \kappa^{-1}d^2H^2)$, eliminating the dependence on $\kappa^{-1}$ in the dominant term for the first time. We then address the computational challenges by introducing an enhanced algorithm that achieves the same regret guarantee but with only constant cost. Finally, we establish the first lower bound for this problem, justifying the optimality of our results in $d$ and $K$.
Related papers
- Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
We propose a model-based algorithm based on novel cost and reward function estimators.
Our algorithm achieves a regret upper bound of $widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$ where $bar C$ is the cost budget for an episode.
arXiv Detail & Related papers (2024-10-14T04:51:06Z) - Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation [8.274693573069442]
We study reinforcement learning with multinomial logistic (MNL) function approximation.
We propose provably efficient algorithms with randomized exploration having frequentist regret guarantees.
Numerical experiments demonstrate the superior performance of the proposed algorithms.
arXiv Detail & Related papers (2024-05-30T15:39:19Z) - Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function
Approximation: Minimax Optimal and Instance-Dependent Regret Bounds [26.277745106128197]
In this work, we address the challenge of such rewards in reinforcement learning with linear function approximation.
We first design an algorithm, textscHeavy-OFUL, for heavy-tailed linear bandits, achieving an emphinstance-dependent $T$-round regret of $tildeObig.
Our result is achieved via a novel self-normalized concentration inequality that may be of independent interest in handling heavytailed noise in general online regression problems.
arXiv Detail & Related papers (2023-06-12T02:56:09Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - TURF: A Two-factor, Universal, Robust, Fast Distribution Learning
Algorithm [64.13217062232874]
One of its most powerful and successful modalities approximates every distribution to an $ell$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d_$.
We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation.
arXiv Detail & Related papers (2022-02-15T03:49:28Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.