A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs
- URL: http://arxiv.org/abs/2106.13013v1
- Date: Thu, 24 Jun 2021 13:46:09 GMT
- Title: A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs
- Authors: Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
- Abstract summary: We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
- Score: 117.82903457289584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We derive a novel asymptotic problem-dependent lower-bound for regret
minimization in finite-horizon tabular Markov Decision Processes (MDPs). While,
similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution
to an optimization problem, our derivation reveals the need for an additional
constraint on the visitation distribution over state-action pairs that
explicitly accounts for the dynamics of the MDP. We provide a characterization
of our lower-bound through a series of examples illustrating how different MDPs
may have significantly different complexity. 1) We first consider a "difficult"
MDP instance, where the novel constraint based on the dynamics leads to a
larger lower-bound (i.e., a larger regret) compared to the classical analysis.
2) We then show that our lower-bound recovers results previously derived for
specific MDP instances. 3) Finally, we show that, in certain "simple" MDPs, the
lower bound is considerably smaller than in the general case and it does not
scale with the minimum action gap at all. We show that this last result is
attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a
regret upper-bound based on policy gaps for an optimistic algorithm.
Related papers
- Towards Instance-Optimality in Online PAC Reinforcement Learning [28.156332484814616]
We propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy.
We demonstrate that the sample complexity of the PEDEL algorithm of citeWagenmaker22linearMDP closely approaches this lower bound.
arXiv Detail & Related papers (2023-10-31T19:26:36Z) - Lower Bounds for Learning in Revealing POMDPs [88.23337313766355]
This paper studies the fundamental limits of reinforcement learning (RL) in the challenging emphpartially observable setting.
For emphmulti-step revealing POMDPs, we show that the latent state-space dependence is at least $Omega(S1.5)$ in the sample complexity.
arXiv Detail & Related papers (2023-02-02T18:59:30Z) - Near-Optimal Sample Complexity Bounds for Constrained MDPs [25.509556551558834]
We provide minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP.
We show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
arXiv Detail & Related papers (2022-06-13T15:58:14Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Under-Approximating Expected Total Rewards in POMDPs [68.8204255655161]
We consider the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP)
We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well.
arXiv Detail & Related papers (2022-01-21T16:43:03Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds
Revisited [46.682733513730334]
We propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs.
Our main contribution is a novel lower bound of $Omega((H3SA/epsilon2)log (1/delta))$ on the sample complexity of an $(varepsilon,delta)$-PAC algorithm for best policy identification in a non-stationary MDP.
arXiv Detail & Related papers (2020-10-07T17:23:01Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.