Dichotomy of Control: Separating What You Can Control from What You
Cannot
- URL: http://arxiv.org/abs/2210.13435v1
- Date: Mon, 24 Oct 2022 17:49:56 GMT
- Title: Dichotomy of Control: Separating What You Can Control from What You
Cannot
- Authors: Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum
- Abstract summary: We propose a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environmentity)
We show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior.
- Score: 129.62135987416164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Future- or return-conditioned supervised learning is an emerging paradigm for
offline reinforcement learning (RL), where the future outcome (i.e., return)
associated with an observed action sequence is used as input to a policy
trained to imitate those same actions. While return-conditioning is at the
heart of popular algorithms such as decision transformer (DT), these methods
tend to perform poorly in highly stochastic environments, where an occasional
high return can arise from randomness in the environment rather than the
actions themselves. Such situations can lead to a learned policy that is
inconsistent with its conditioning inputs; i.e., using the policy to act in the
environment, when conditioning on a specific desired return, leads to a
distribution of real returns that is wildly different than desired. In this
work, we propose the dichotomy of control (DoC), a future-conditioned
supervised learning framework that separates mechanisms within a policy's
control (actions) from those beyond a policy's control (environment
stochasticity). We achieve this separation by conditioning the policy on a
latent variable representation of the future, and designing a mutual
information constraint that removes any information from the latent variable
associated with randomness in the environment. Theoretically, we show that DoC
yields policies that are consistent with their conditioning inputs, ensuring
that conditioning a learned policy on a desired high-return future outcome will
correctly induce high-return behavior. Empirically, we show that DoC is able to
achieve significantly better performance than DT on environments that have
highly stochastic rewards and transition
Related papers
- Decision Making in Non-Stationary Environments with Policy-Augmented
Search [9.000981144624507]
We introduce textitPolicy-Augmented Monte Carlo tree search (PA-MCTS)
It combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment.
We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy.
arXiv Detail & Related papers (2024-01-06T11:51:50Z) - Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward.
This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Robust Deep Reinforcement Learning for Quadcopter Control [0.8687092759073857]
In this work, we use Robust Markov Decision Processes (RMDP) to train the drone control policy.
It opts for pessimistic optimization to handle potential gaps between policy transfer from one environment to another.
The trained control policy is tested on the task of quadcopter positional control.
arXiv Detail & Related papers (2021-11-06T16:35:13Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z) - Evolutionary Stochastic Policy Distillation [139.54121001226451]
We propose a new method called Evolutionary Policy Distillation (ESPD) to solve GCRS tasks.
ESPD enables a target policy to learn from a series of its variants through the technique of policy distillation (PD)
The experiments based on the MuJoCo control suite show the high learning efficiency of the proposed method.
arXiv Detail & Related papers (2020-04-27T16:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.