Causal Markov Decision Processes: Learning Good Interventions
Efficiently
- URL: http://arxiv.org/abs/2102.07663v1
- Date: Mon, 15 Feb 2021 16:48:54 GMT
- Title: Causal Markov Decision Processes: Learning Good Interventions
Efficiently
- Authors: Yangyi Lu, Amirhossein Meisami, Ambuj Tewari
- Abstract summary: We introduce causal Markov Decision Processes (C-MDPs), a new formalism for sequential decision making.
Many contemporary and emerging application areas such as digital healthcare and digital marketing can benefit from modeling with C-MDPs.
- Score: 24.58691421788476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce causal Markov Decision Processes (C-MDPs), a new formalism for
sequential decision making which combines the standard MDP formulation with
causal structures over state transition and reward functions. Many contemporary
and emerging application areas such as digital healthcare and digital marketing
can benefit from modeling with C-MDPs due to the causal mechanisms underlying
the relationship between interventions and states/rewards. We propose the
causal upper confidence bound value iteration (C-UCBVI) algorithm that exploits
the causal structure in C-MDPs and improves the performance of standard
reinforcement learning algorithms that do not take causal knowledge into
account. We prove that C-UCBVI satisfies an $\tilde{O}(HS\sqrt{ZT})$ regret
bound, where $T$ is the the total time steps, $H$ is the episodic horizon, and
$S$ is the cardinality of the state space. Notably, our regret bound does not
scale with the size of actions/interventions ($A$), but only scales with a
causal graph dependent quantity $Z$ which can be exponentially smaller than
$A$. By extending C-UCBVI to the factored MDP setting, we propose the causal
factored UCBVI (CF-UCBVI) algorithm, which further reduces the regret
exponentially in terms of $S$. Furthermore, we show that RL algorithms for
linear MDP problems can also be incorporated in C-MDPs. We empirically show the
benefit of our causal approaches in various settings to validate our algorithms
and theoretical results.
Related papers
- Monte Carlo Planning for Stochastic Control on Constrained Markov Decision Processes [1.445706856497821]
This work defines an MDP framework, the textttSD-MDP, where we disentangle the causal structure of MDPs' transition and reward dynamics.
We derive theoretical guarantees on the estimation error of the value function under an optimal policy by allowing independent value estimation from Monte Carlo sampling.
arXiv Detail & Related papers (2024-06-23T16:22:40Z) - Settling Constant Regrets in Linear Markov Decision Processes [57.34287648914407]
We study the constant regret guarantees in reinforcement learning (RL)
We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs)
For an MDP characterized by a minimal suboptimality gap $Delta$, Cert-LSVI-UCB has a cumulative regret of $tildemathcalO(d3H5/Delta)$ with high probability, provided that the misspecification level $zeta$ is below $tildemathcalO(Delta / (sqrtd
arXiv Detail & Related papers (2024-04-16T17:23:19Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Improved Exploration in Factored Average-Reward MDPs [23.096751699592133]
We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP)
We introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function.
We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $mathcal S_i$'s and the involved diameter-related terms.
arXiv Detail & Related papers (2020-09-09T21:15:01Z) - Efficient Reinforcement Learning in Factored MDPs with Application to
Constrained RL [25.119552984253882]
Reinforcement learning in episodic, factored Markov decision processes (FMDPs) is studied.
We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP.
As an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK)
arXiv Detail & Related papers (2020-08-31T02:20:41Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and
Tighter Regret Bounds for the Non-Episodic Setting [24.90164851620799]
We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs)
We propose two near-optimal and oracle-efficient algorithms for FMDPs.
Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.
arXiv Detail & Related papers (2020-02-06T15:19:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.