Knowledge Transfer from Teachers to Learners in Growing-Batch
Reinforcement Learning
- URL: http://arxiv.org/abs/2305.03870v2
- Date: Tue, 9 May 2023 22:25:00 GMT
- Title: Knowledge Transfer from Teachers to Learners in Growing-Batch
Reinforcement Learning
- Authors: Patrick Emedom-Nnamdi, Abram L. Friesen, Bobak Shahriari, Nando de
Freitas, Matt W. Hoffman
- Abstract summary: Control policies in real-world domains are typically trained offline from previously logged data or in a growing-batch manner.
In this setting a fixed policy is deployed to the environment and used to gather an entire batch of new data before being aggregated with past batches and used to update the policy.
While a limited number of such cycles is feasible in real-world domains, the quality and diversity of the resulting data are much lower than in the standard continually-interacting approach.
- Score: 8.665235113831685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard approaches to sequential decision-making exploit an agent's ability
to continually interact with its environment and improve its control policy.
However, due to safety, ethical, and practicality constraints, this type of
trial-and-error experimentation is often infeasible in many real-world domains
such as healthcare and robotics. Instead, control policies in these domains are
typically trained offline from previously logged data or in a growing-batch
manner. In this setting a fixed policy is deployed to the environment and used
to gather an entire batch of new data before being aggregated with past batches
and used to update the policy. This improvement cycle can then be repeated
multiple times. While a limited number of such cycles is feasible in real-world
domains, the quality and diversity of the resulting data are much lower than in
the standard continually-interacting approach. However, data collection in
these domains is often performed in conjunction with human experts, who are
able to label or annotate the collected data. In this paper, we first explore
the trade-offs present in this growing-batch setting, and then investigate how
information provided by a teacher (i.e., demonstrations, expert actions, and
gradient information) can be leveraged at training time to mitigate the sample
complexity and coverage requirements for actor-critic methods. We validate our
contributions on tasks from the DeepMind Control Suite.
Related papers
- Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy.
We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode.
We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z) - Where is the Truth? The Risk of Getting Confounded in a Continual World [21.862370510786004]
A dataset is confounded if it is most easily solved via a spurious correlation, which fails to generalize to new data.
In a continual learning setting where confounders may vary in time across tasks, the challenge of mitigating the effect of confounders far exceeds the standard forgetting problem.
arXiv Detail & Related papers (2024-02-09T14:24:18Z) - Generative appearance replay for continual unsupervised domain
adaptation [4.623578780480946]
GarDA is a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data.
We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
arXiv Detail & Related papers (2023-01-03T17:04:05Z) - Let Offline RL Flow: Training Conservative Agents in the Latent Space of
Normalizing Flows [58.762959061522736]
offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions.
We build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model.
We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms.
arXiv Detail & Related papers (2022-11-20T21:57:10Z) - Data augmentation for efficient learning from parametric experts [88.33380893179697]
We focus on what we call the policy cloning setting, in which we use online or offline queries of an expert to inform the behavior of a student policy.
Our approach, augmented policy cloning (APC), uses synthetic states to induce feedback-sensitivity in a region around sampled trajectories.
We achieve highly data-efficient transfer of behavior from an expert to a student policy for high-degrees-of-freedom control problems.
arXiv Detail & Related papers (2022-05-23T16:37:16Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Learning without Knowing: Unobserved Context in Continuous Transfer
Reinforcement Learning [16.814772057210366]
We consider a transfer Reinforcement Learning problem in continuous state and action spaces under unobserved contextual information.
Our goal is to use the context-aware expert data to learn an optimal context-unaware policy for the learner using only a few new data samples.
arXiv Detail & Related papers (2021-06-07T17:49:22Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - Keep Doing What Worked: Behavioral Modelling Priors for Offline
Reinforcement Learning [25.099754758455415]
Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set of environment interactions is available.
Standard off-policy algorithms fail in the batch setting for continuous control.
arXiv Detail & Related papers (2020-02-19T19:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.