Intervention Efficient Algorithm for Two-Stage Causal MDPs
- URL: http://arxiv.org/abs/2111.00886v1
- Date: Mon, 1 Nov 2021 12:22:37 GMT
- Title: Intervention Efficient Algorithm for Two-Stage Causal MDPs
- Authors: Rahul Madhavan, Aurghya Maiti, Gaurav Sinha and Siddharth Barman
- Abstract summary: We study Markov Decision Processes (MDP) wherein states correspond to causal graphs that generate rewards.
In this setup, the learner's goal is to identify atomic interventions that lead to high rewards by intervening on variables at each state.
Generalizing the recent causal-bandit framework, the current work develops (simple) regret minimization guarantees for two-stage causal MDPs.
- Score: 15.838256272508357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study Markov Decision Processes (MDP) wherein states correspond to causal
graphs that stochastically generate rewards. In this setup, the learner's goal
is to identify atomic interventions that lead to high rewards by intervening on
variables at each state. Generalizing the recent causal-bandit framework, the
current work develops (simple) regret minimization guarantees for two-stage
causal MDPs, with parallel causal graph at each state. We propose an algorithm
that achieves an instance dependent regret bound. A key feature of our
algorithm is that it utilizes convex optimization to address the exploration
problem. We identify classes of instances wherein our regret guarantee is
essentially tight, and experimentally validate our theoretical results.
Related papers
- Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [51.38330727868982]
Bidirectional Decoding (BID) is a test-time inference algorithm that bridges action chunking with closed-loop operations.
We show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives [16.101435842520473]
This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem.
Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP.
We present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space.
arXiv Detail & Related papers (2024-06-05T02:33:50Z) - Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs [9.58750210024265]
We consider (stochastic) softmax policy gradient (PG) methods for bandits and Markov decision processes (MDPs)
We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities.
For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise.
arXiv Detail & Related papers (2024-05-21T18:12:39Z) - Confounded Budgeted Causal Bandits [28.199741662190203]
We study the problem of learning 'good' interventions in an environment modeled by its underlying causal graph.
Good interventions refer to interventions that maximize rewards.
We propose an algorithm to minimize the cumulative regret in general causal graphs.
arXiv Detail & Related papers (2024-01-15T10:26:13Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs)
Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals.
We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.