BATS: Best Action Trajectory Stitching
- URL: http://arxiv.org/abs/2204.12026v1
- Date: Tue, 26 Apr 2022 01:48:32 GMT
- Title: BATS: Best Action Trajectory Stitching
- Authors: Ian Char, Viraj Mehta, Adam Villaflor, John M. Dolan, Jeff Schneider
- Abstract summary: We introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset.
We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics.
We show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.
- Score: 22.75880303352508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The problem of offline reinforcement learning focuses on learning a good
policy from a log of environment interactions. Past efforts for developing
algorithms in this area have revolved around introducing constraints to online
reinforcement learning algorithms to ensure the actions of the learned policy
are constrained to the logged data. In this work, we explore an alternative
approach by planning on the fixed dataset directly. Specifically, we introduce
an algorithm which forms a tabular Markov Decision Process (MDP) over the
logged data by adding new transitions to the dataset. We do this by using
learned dynamics models to plan short trajectories between states. Since exact
value iteration can be performed on this constructed MDP, it becomes easy to
identify which trajectories are advantageous to add to the MDP. Crucially,
since most transitions in this MDP come from the logged data, trajectories from
the MDP can be rolled out for long periods with confidence. We prove that this
property allows one to make upper and lower bounds on the value function up to
appropriate distance metrics. Finally, we demonstrate empirically how
algorithms that uniformly constrain the learned policy to the entire dataset
can result in unwanted behavior, and we show an example in which simply
behavior cloning the optimal policy of the MDP created by our algorithm avoids
this problem.
Related papers
- Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory
Weighting [29.21380944341589]
We show that state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit trajectories to the fullest.
This reweighted sampling strategy may be combined with any offline RL algorithm.
We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms fully exploit the dataset.
arXiv Detail & Related papers (2023-06-22T17:58:02Z) - Safe Policy Improvement for POMDPs via Finite-State Controllers [6.022036788651133]
We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs)
SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner.
We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability.
arXiv Detail & Related papers (2023-01-12T11:22:54Z) - Model-based trajectory stitching for improved behavioural cloning and
its applications [7.462336024223669]
Trajectory Stitching (TS) generates new trajectories by stitching' pairs of states that were disconnected in the original data.
We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy.
arXiv Detail & Related papers (2022-12-08T14:18:04Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Continuous MDP Homomorphisms and Homomorphic Policy Gradient [51.25171126424949]
We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces.
We propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously.
arXiv Detail & Related papers (2022-09-15T15:26:49Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ.
We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments.
We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z) - Expert-Guided Symmetry Detection in Markov Decision Processes [0.0]
We propose a paradigm that aims to detect the presence of some transformations of the state-action space for which the MDP dynamics is invariant.
The results show that the model distributional shift is reduced when the dataset is augmented with the data obtained by using the detected symmetries.
arXiv Detail & Related papers (2021-11-19T16:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.