A Graphical Approach to State Variable Selection in Off-policy Learning
- URL: http://arxiv.org/abs/2501.00854v1
- Date: Wed, 01 Jan 2025 14:37:35 GMT
- Title: A Graphical Approach to State Variable Selection in Off-policy Learning
- Authors: Joakim Blach Andersen, Qingyuan Zhao,
- Abstract summary: We provide a set of graphical identification criteria in general decision processes.
We discuss how our results relate to the often implicit causal assumptions made in the dynamic treatment regimes and offline reinforcement learning literatures.
We present a realistic simulation study for the dynamic pricing problem encountered in container logistics.
- Score: 0.0
- License:
- Abstract: Sequential decision problems are widely studied across many areas of science. A key challenge when learning policies from historical data - a practice commonly referred to as off-policy learning - is how to ``identify'' the impact of a policy of interest when the observed data are not randomized. Off-policy learning has mainly been studied in two settings: dynamic treatment regimes (DTRs), where the focus is on controlling confounding in medical problems with short decision horizons, and offline reinforcement learning (RL), where the focus is on dimension reduction in closed systems such as games. The gap between these two well studied settings has limited the wider application of off-policy learning to many real-world problems. Using the theory for causal inference based on acyclic directed mixed graph (ADMGs), we provide a set of graphical identification criteria in general decision processes that encompass both DTRs and MDPs. We discuss how our results relate to the often implicit causal assumptions made in the DTR and RL literatures and further clarify several common misconceptions. Finally, we present a realistic simulation study for the dynamic pricing problem encountered in container logistics, and demonstrate how violations of our graphical criteria can lead to suboptimal policies.
Related papers
- An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models [20.314426291330278]
In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.)
This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling.
We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution.
arXiv Detail & Related papers (2024-04-23T21:02:58Z) - Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data [17.991833729722288]
We propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL)
Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function.
We provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
arXiv Detail & Related papers (2024-03-18T14:51:19Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Solving the flexible job-shop scheduling problem through an enhanced
deep reinforcement learning approach [1.565361244756411]
This paper introduces a new DRL method for solving the flexible job-shop scheduling problem, particularly for large instances.
The approach is based on the use of heterogeneous graph neural networks to a more informative graph representation of the problem.
arXiv Detail & Related papers (2023-10-24T10:35:08Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions [13.774600272141761]
Reinforcement learning algorithms have proven transformative in a range of domains.
Much theory of RL has focused on discrete state spaces or worst-case analysis.
We propose a solvable high-dimensional model of RL that can capture a variety of learning protocols.
arXiv Detail & Related papers (2023-06-17T18:16:51Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z) - Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data.
We formalize the relevant causal structure of problems such as dynamic personalized pricing.
We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z) - Policy Information Capacity: Information-Theoretic Measure for Task
Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty.
We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives.
These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.