Model approximation in MDPs with unbounded per-step cost
- URL: http://arxiv.org/abs/2402.08813v1
- Date: Tue, 13 Feb 2024 21:36:30 GMT
- Title: Model approximation in MDPs with unbounded per-step cost
- Authors: Berk Bozkurt, Aditya Mahajan, Ashutosh Nayyar, Yi Ouyang
- Abstract summary: We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $mathcalM$ when we only have access to an approximate model $hatmathcalM$.
How well does an optimal policy $hatpistar$ of the approximate model perform when used in the original model $mathcalM$?
We provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models.
- Score: 3.456139143869137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of designing a control policy for an infinite-horizon
discounted cost Markov decision process $\mathcal{M}$ when we only have access
to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy
$\hat{\pi}^{\star}$ of the approximate model perform when used in the original
model $\mathcal{M}$? We answer this question by bounding a weighted norm of the
difference between the value function of $\hat{\pi}^\star $ when used in
$\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend
our results and obtain potentially tighter upper bounds by considering affine
transformations of the per-step cost. We further provide upper bounds that
explicitly depend on the weighted distance between cost functions and weighted
distance between transition kernels of the original and approximate models. We
present examples to illustrate our results.
Related papers
- Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation.
We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Faster Sampling from Log-Concave Distributions over Polytopes via a
Soft-Threshold Dikin Walk [28.431572772564518]
We consider the problem of sampling from a $d$-dimensional log-concave distribution $pi(theta) propto e-f(theta)$ constrained to a polytope $K$ defined by $m$ inequalities.
Our main result is a "soft-warm'' variant of the Dikin walk Markov chain that requires at most $O((md + d L2 R2) times MDomega-1) log(fracwdelta)$ arithmetic operations to sample from $pi$
arXiv Detail & Related papers (2022-06-19T11:33:07Z) - Model Selection with Near Optimal Rates for Reinforcement Learning with
General Model Classes [27.361399036211694]
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem.
In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels.
We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
arXiv Detail & Related papers (2021-07-13T05:00:38Z) - Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs
with a Generative Model [3.749193647980305]
This paper considers a Markov decision process (MDP) that admits a set of state-action features.
We show that a model-based approach (resp.$$Q-learning) provably learns an $varepsilon$-optimal policy with high probability.
arXiv Detail & Related papers (2021-05-28T17:49:39Z) - Convergence of Sparse Variational Inference in Gaussian Processes
Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z) - Model-Based Reinforcement Learning with Value-Targeted Regression [48.92439657407732]
We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $mathcalP$.
We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $tildemathcalO(dsqrtH3T)$.
arXiv Detail & Related papers (2020-06-01T17:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.