Minimax Weight Learning for Absorbing MDPs
- URL: http://arxiv.org/abs/2301.03183v2
- Date: Tue, 5 Sep 2023 11:11:14 GMT
- Title: Minimax Weight Learning for Absorbing MDPs
- Authors: Fengyin Li, Yuqiang Li, Xianyi Wu
- Abstract summary: We study undiscounted off-policy policy evaluation for absorbing MDPs.
We propose a so-called MWLA algorithm to directly estimate the expected return via the importance ratio of the state-action occupancy measure.
- Score: 0.276240219662896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning policy evaluation problems are often modeled as finite
or discounted/averaged infinite-horizon MDPs. In this paper, we study
undiscounted off-policy policy evaluation for absorbing MDPs. Given the dataset
consisting of the i.i.d episodes with a given truncation level, we propose a
so-called MWLA algorithm to directly estimate the expected return via the
importance ratio of the state-action occupancy measure. The Mean Square Error
(MSE) bound for the MWLA method is investigated and the dependence of
statistical errors on the data size and the truncation level are analyzed. With
an episodic taxi environment, computational experiments illustrate the
performance of the MWLA algorithm.
Related papers
- Scalable Policy-Based RL Algorithms for POMDPs [6.2229686397601585]
We consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating a POMDP model into a finite-state Markov Decision Process (MDP)<n>We show that the approximation error decreases exponentially with the length of this history.<n>To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
arXiv Detail & Related papers (2025-10-08T00:33:38Z) - Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits [3.5219188193742563]
Inverse Propensity Score (IPS) weighting suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored.<n>Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings.<n>We introduce Context-Action Embedding Learning for MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator.
arXiv Detail & Related papers (2025-08-31T00:55:55Z) - Near-Optimal Learning and Planning in Separated Latent MDPs [70.88315649628251]
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs)
In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs.
arXiv Detail & Related papers (2024-06-12T06:41:47Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Meta Learning in the Continuous Time Limit [36.23467808322093]
We establish the ordinary differential equation (ODE) that underlies the training dynamics of Model-A Meta-Learning (MAML)
We propose a new BI-MAML training algorithm that significantly reduces the computational burden associated with existing MAML training methods.
arXiv Detail & Related papers (2020-06-19T01:47:31Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.