Demystifying Linear MDPs and Novel Dynamics Aggregation Framework
- URL: http://arxiv.org/abs/2410.24089v1
- Date: Thu, 31 Oct 2024 16:21:41 GMT
- Title: Demystifying Linear MDPs and Novel Dynamics Aggregation Framework
- Authors: Joongkyu Lee, Min-hwan Oh,
- Abstract summary: In linear MDPs, the feature dimension $d$ is lower bounded by $S/U$ in order to aptly represent transition probabilities.
We propose a novel structural aggregation framework based on dynamics, named as the "dynamics aggregation"
Our proposed algorithm exhibits statistical efficiency, achieving a regret of $ tildeO ( d_psi3/2 H3/2sqrt T)$, where $d_psi$ represents the feature dimension of aggregated subMDPs.
- Score: 8.087699764574788
- License:
- Abstract: In this work, we prove that, in linear MDPs, the feature dimension $d$ is lower bounded by $S/U$ in order to aptly represent transition probabilities, where $S$ is the size of the state space and $U$ is the maximum size of directly reachable states. Hence, $d$ can still scale with $S$ depending on the direct reachability of the environment. To address this limitation of linear MDPs, we propose a novel structural aggregation framework based on dynamics, named as the "dynamics aggregation". For this newly proposed framework, we design a provably efficient hierarchical reinforcement learning algorithm in linear function approximation that leverages aggregated sub-structures. Our proposed algorithm exhibits statistical efficiency, achieving a regret of $ \tilde{O} ( d_{\psi}^{3/2} H^{3/2}\sqrt{ N T} )$, where $d_{\psi}$ represents the feature dimension of aggregated subMDPs and $N$ signifies the number of aggregated subMDPs. We establish that the condition $d_{\psi}^3 N \ll d^{3}$ is readily met in most real-world environments with hierarchical structures, enabling a substantial improvement in the regret bound compared to LSVI-UCB, which enjoys a regret of $ \tilde{O} (d^{3/2} H^{3/2} \sqrt{ T})$. To the best of our knowledge, this work presents the first HRL algorithm with linear function approximation that offers provable guarantees.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement
Learning with Latent Low-Rank Structure [9.759209713196718]
We consider a class of MDPs for which the associated optimal $Q*$ function is low rank, where the latent features are unknown.
We show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$ for a rank
arXiv Detail & Related papers (2022-06-07T20:39:51Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs
with a Generative Model [3.749193647980305]
This paper considers a Markov decision process (MDP) that admits a set of state-action features.
We show that a model-based approach (resp.$$Q-learning) provably learns an $varepsilon$-optimal policy with high probability.
arXiv Detail & Related papers (2021-05-28T17:49:39Z) - Logarithmic Regret for Reinforcement Learning with Linear Function
Approximation [99.59319332864129]
We show that logarithmic regret is attainable under two recently proposed linear MDP assumptions.
To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation.
arXiv Detail & Related papers (2020-11-23T17:25:00Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.