Online Reinforcement Learning in Markov Decision Process Using Linear
Programming
- URL: http://arxiv.org/abs/2304.00155v3
- Date: Mon, 11 Mar 2024 00:38:26 GMT
- Title: Online Reinforcement Learning in Markov Decision Process Using Linear
Programming
- Authors: Vincent Leon, S. Rasoul Etesami
- Abstract summary: We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and rewards drawn from some fixed but unknown distribution.
We devise a simple and efficient model-based algorithm that achieves $widetildeO(LXsqrtTA)$ regret with high probability.
- Score: 1.0878040851638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider online reinforcement learning in episodic Markov decision process
(MDP) with unknown transition function and stochastic rewards drawn from some
fixed but unknown distribution. The learner aims to learn the optimal policy
and minimize their regret over a finite time horizon through interacting with
the environment. We devise a simple and efficient model-based algorithm that
achieves $\widetilde{O}(LX\sqrt{TA})$ regret with high probability, where $L$
is the episode length, $T$ is the number of episodes, and $X$ and $A$ are the
cardinalities of the state space and the action space, respectively. The
proposed algorithm, which is based on the concept of ``optimism in the face of
uncertainty", maintains confidence sets of transition and reward functions and
uses occupancy measures to connect the online MDP with linear programming. It
achieves a tighter regret bound compared to the existing works that use a
similar confidence set framework and improves computational effort compared to
those that use a different framework but with a slightly tighter regret bound.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization [11.11876897168701]
We propose an algorithm that achieves a regret bound of order $tildemathcalO(mathrmpoly(H)sqrtSAT)$.
The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures.
arXiv Detail & Related papers (2024-07-08T08:06:45Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs
with Near-optimal Regret [44.374427255708135]
We propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs)
EE-QL assumes that an online concentrating approximation of the optimal average reward is available.
This is the first model-free learning algorithm that achieves $O(sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors.
arXiv Detail & Related papers (2020-06-08T05:09:32Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.