Variational Latent Branching Model for Off-Policy Evaluation
- URL: http://arxiv.org/abs/2301.12056v3
- Date: Wed, 1 Feb 2023 01:57:10 GMT
- Title: Variational Latent Branching Model for Off-Policy Evaluation
- Authors: Qitong Gao, Ge Gao, Min Chi, Miroslav Pajic
- Abstract summary: We propose a variational latent branching model (VLBM) to learn the transition function of Markov decision processes (MDPs)
We introduce the branching architecture to improve the model's robustness against randomly model weights.
We show that the VLBM outperforms existing state-of-the-art OPE methods in general.
- Score: 23.073461349048834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model-based methods have recently shown great potential for off-policy
evaluation (OPE); offline trajectories induced by behavioral policies are
fitted to transitions of Markov decision processes (MDPs), which are used to
rollout simulated trajectories and estimate the performance of policies.
Model-based OPE methods face two key challenges. First, as offline trajectories
are usually fixed, they tend to cover limited state and action space. Second,
the performance of model-based methods can be sensitive to the initialization
of their parameters. In this work, we propose the variational latent branching
model (VLBM) to learn the transition function of MDPs by formulating the
environmental dynamics as a compact latent space, from which the next states
and rewards are then sampled. Specifically, VLBM leverages and extends the
variational inference framework with the recurrent state alignment (RSA), which
is designed to capture as much information underlying the limited training
data, by smoothing out the information flow between the variational (encoding)
and generative (decoding) part of VLBM. Moreover, we also introduce the
branching architecture to improve the model's robustness against randomly
initialized model weights. The effectiveness of the VLBM is evaluated on the
deep OPE (DOPE) benchmark, from which the training trajectories are designed to
result in varied coverage of the state-action space. We show that the VLBM
outperforms existing state-of-the-art OPE methods in general.
Related papers
- Optimization of geological carbon storage operations with multimodal latent dynamic model and deep reinforcement learning [1.8549313085249324]
This study introduces the multimodal latent dynamic (MLD) model, a deep learning framework for fast flow prediction and well control optimization in GCS.
Unlike existing models, the MLD supports diverse input modalities, allowing comprehensive data interactions.
The approach outperforms traditional methods, achieving the highest NPV while reducing computational resources by over 60%.
arXiv Detail & Related papers (2024-06-07T01:30:21Z) - Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Learning non-Markovian Decision-Making from State-only Sequences [57.20193609153983]
We develop a model-based imitation of state-only sequences with non-Markov Decision Process (nMDP)
We demonstrate the efficacy of the proposed method in a path planning task with non-Markovian constraints.
arXiv Detail & Related papers (2023-06-27T02:26:01Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Data Augmentation through Expert-guided Symmetry Detection to Improve
Performance in Offline Reinforcement Learning [0.0]
offline estimation of the dynamical model of a Markov Decision Process (MDP) is a non-trivial task.
Recent works showed that an expert-guided pipeline relying on Density Estimation methods effectively detects this structure in deterministic environments.
We show that the former results lead to a performance improvement when solving the learned MDP and then applying the optimized policy in the real environment.
arXiv Detail & Related papers (2021-12-18T14:32:32Z) - Revisiting Design Choices in Model-Based Offline Reinforcement Learning [39.01805509055988]
Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies.
This paper compares and designs novel protocols to investigate their interaction with other hyper parameters, such as the number of models, or imaginary rollout horizon.
arXiv Detail & Related papers (2021-10-08T13:51:34Z) - Autoregressive Dynamics Models for Offline Policy Evaluation and
Optimization [60.73540999409032]
We show that expressive autoregressive dynamics models generate different dimensions of the next state and reward sequentially conditioned on previous dimensions.
We also show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer.
arXiv Detail & Related papers (2021-04-28T16:48:44Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.