Efficient Reinforcement Learning in Factored MDPs with Application to
Constrained RL
- URL: http://arxiv.org/abs/2008.13319v3
- Date: Wed, 10 Mar 2021 01:58:03 GMT
- Title: Efficient Reinforcement Learning in Factored MDPs with Application to
Constrained RL
- Authors: Xiaoyu Chen, Jiachen Hu, Lihong Li, Liwei Wang
- Abstract summary: Reinforcement learning in episodic, factored Markov decision processes (FMDPs) is studied.
We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP.
As an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK)
- Score: 25.119552984253882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) in episodic, factored Markov decision processes
(FMDPs) is studied. We propose an algorithm called FMDP-BF, which leverages the
factorization structure of FMDP. The regret of FMDP-BF is shown to be
exponentially smaller than that of optimal algorithms designed for non-factored
MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near}
by a factored of $\sqrt{H|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the
cardinality of the factored state subspace and $H$ is the planning horizon. To
show the optimality of our bounds, we also provide a lower bound for FMDP,
which indicates that our algorithm is near-optimal w.r.t. timestep $T$, horizon
$H$ and factored state-action subspace cardinality. Finally, as an application,
we study a new formulation of constrained RL, known as RL with knapsack
constraints (RLwK), and provides the first sample-efficient algorithm based on
FMDP-BF.
Related papers
- Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Regularization and Variance-Weighted Regression Achieves Minimax
Optimality in Linear MDPs: Theory and Practice [79.48432795639403]
Mirror descent value iteration (MDVI) is an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL)
We study MDVI with linear function approximation through its sample complexity required to identify an $varepsilon$-optimal policy.
We present Variance-Weighted Least-Squares MDVI, the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs.
arXiv Detail & Related papers (2023-05-22T16:13:05Z) - Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation [10.159501412046508]
We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP)
We establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model.
To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees.
arXiv Detail & Related papers (2022-12-27T16:25:09Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Improved Exploration in Factored Average-Reward MDPs [23.096751699592133]
We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP)
We introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function.
We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $mathcal S_i$'s and the involved diameter-related terms.
arXiv Detail & Related papers (2020-09-09T21:15:01Z) - Towards Minimax Optimal Reinforcement Learning in Factored Markov
Decision Processes [53.72166325215299]
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs)
First one achieves minimax optimal regret guarantees for a rich class of factored structures.
Second one enjoys better computational complexity with a slightly worse regret.
arXiv Detail & Related papers (2020-06-24T00:50:17Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and
Tighter Regret Bounds for the Non-Episodic Setting [24.90164851620799]
We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs)
We propose two near-optimal and oracle-efficient algorithms for FMDPs.
Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.
arXiv Detail & Related papers (2020-02-06T15:19:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.