Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and
Tighter Regret Bounds for the Non-Episodic Setting
- URL: http://arxiv.org/abs/2002.02302v2
- Date: Sat, 6 Jun 2020 03:30:47 GMT
- Title: Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and
Tighter Regret Bounds for the Non-Episodic Setting
- Authors: Ziping Xu and Ambuj Tewari
- Abstract summary: We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs)
We propose two near-optimal and oracle-efficient algorithms for FMDPs.
Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.
- Score: 24.90164851620799
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning in non-episodic factored Markov decision
processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms
for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and
a frequentist regret bound respectively, both of which reduce to the
near-optimal bound $\widetilde{O}(DS\sqrt{AT})$ for standard non-factored MDPs.
We propose a tighter connectivity measure, factored span, for FMDPs and prove a
lower bound that depends on the factored span rather than the diameter $D$. In
order to decrease the gap between lower and upper bounds, we propose an
adaptation of the REGAL.C algorithm whose regret bound depends on the factored
span. Our oracle-efficient algorithms outperform previously proposed
near-optimal algorithms on computer network administration simulations.
Related papers
- Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling [14.776559457850624]
We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP)
The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms.
arXiv Detail & Related papers (2024-05-29T11:59:56Z) - Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs [31.673857053336352]
We propose an algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$.
We also show the doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation.
arXiv Detail & Related papers (2023-02-24T05:14:27Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes.
We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z) - Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex
Decentralized Optimization Over Time-Varying Networks [79.16773494166644]
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network.
We design two optimal algorithms that attain these lower bounds.
We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-08T15:54:44Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Oracle-Efficient Regret Minimization in Factored MDPs with Unknown
Structure [57.90236104782219]
We study regret in non-episodic factored Markov decision processes (FMDPs)
All existing algorithms make the strong assumption that the factored structure of the FMDP is known to the learner in advance.
We provide the first algorithm that learns the structure of the FMDP while minimizing the regret.
arXiv Detail & Related papers (2020-09-13T12:30:35Z) - Efficient Reinforcement Learning in Factored MDPs with Application to
Constrained RL [25.119552984253882]
Reinforcement learning in episodic, factored Markov decision processes (FMDPs) is studied.
We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP.
As an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK)
arXiv Detail & Related papers (2020-08-31T02:20:41Z) - Towards Minimax Optimal Reinforcement Learning in Factored Markov
Decision Processes [53.72166325215299]
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs)
First one achieves minimax optimal regret guarantees for a rich class of factored structures.
Second one enjoys better computational complexity with a slightly worse regret.
arXiv Detail & Related papers (2020-06-24T00:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.