Learning without Knowing: Unobserved Context in Continuous Transfer
Reinforcement Learning
- URL: http://arxiv.org/abs/2106.03833v1
- Date: Mon, 7 Jun 2021 17:49:22 GMT
- Title: Learning without Knowing: Unobserved Context in Continuous Transfer
Reinforcement Learning
- Authors: Chenyu Liu, Yan Zhang, Yi Shen and Michael M. Zavlanos
- Abstract summary: We consider a transfer Reinforcement Learning problem in continuous state and action spaces under unobserved contextual information.
Our goal is to use the context-aware expert data to learn an optimal context-unaware policy for the learner using only a few new data samples.
- Score: 16.814772057210366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider a transfer Reinforcement Learning (RL) problem in
continuous state and action spaces, under unobserved contextual information.
For example, the context can represent the mental view of the world that an
expert agent has formed through past interactions with this world. We assume
that this context is not accessible to a learner agent who can only observe the
expert data. Then, our goal is to use the context-aware expert data to learn an
optimal context-unaware policy for the learner using only a few new data
samples. Such problems are typically solved using imitation learning that
assumes that both the expert and learner agents have access to the same
information. However, if the learner does not know the expert context, using
the expert data alone will result in a biased learner policy and will require
many new data samples to improve. To address this challenge, in this paper, we
formulate the learning problem as a causal bound-constrained Multi-Armed-Bandit
(MAB) problem. The arms of this MAB correspond to a set of basis policy
functions that can be initialized in an unsupervised way using the expert data
and represent the different expert behaviors affected by the unobserved
context. On the other hand, the MAB constraints correspond to causal bounds on
the accumulated rewards of these basis policy functions that we also compute
from the expert data. The solution to this MAB allows the learner agent to
select the best basis policy and improve it online. And the use of causal
bounds reduces the exploration variance and, therefore, improves the learning
rate. We provide numerical experiments on an autonomous driving example that
show that our proposed transfer RL method improves the learner's policy faster
compared to existing imitation learning methods and enjoys much lower variance
during training.
Related papers
- Knowledge Transfer from Teachers to Learners in Growing-Batch
Reinforcement Learning [8.665235113831685]
Control policies in real-world domains are typically trained offline from previously logged data or in a growing-batch manner.
In this setting a fixed policy is deployed to the environment and used to gather an entire batch of new data before being aggregated with past batches and used to update the policy.
While a limited number of such cycles is feasible in real-world domains, the quality and diversity of the resulting data are much lower than in the standard continually-interacting approach.
arXiv Detail & Related papers (2023-05-05T22:55:34Z) - Deconfounding Imitation Learning with Variational Inference [19.99248795957195]
Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent.
This is because partial observability gives rise to hidden confounders in the causal graph.
We propose to train a variational inference model to infer the expert's latent information and use it to train a latent-conditional policy.
arXiv Detail & Related papers (2022-11-04T18:00:02Z) - Data augmentation for efficient learning from parametric experts [88.33380893179697]
We focus on what we call the policy cloning setting, in which we use online or offline queries of an expert to inform the behavior of a student policy.
Our approach, augmented policy cloning (APC), uses synthetic states to induce feedback-sensitivity in a region around sampled trajectories.
We achieve highly data-efficient transfer of behavior from an expert to a student policy for high-degrees-of-freedom control problems.
arXiv Detail & Related papers (2022-05-23T16:37:16Z) - Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior.
The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z) - Knowledge-driven Active Learning [70.37119719069499]
Active learning strategies aim at minimizing the amount of labelled data required to train a Deep Learning model.
Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary.
Here we propose to take into consideration common domain-knowledge and enable non-expert users to train a model with fewer samples.
arXiv Detail & Related papers (2021-10-15T06:11:53Z) - On Covariate Shift of Latent Confounders in Imitation and Reinforcement
Learning [69.48387059607387]
We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning.
We analyze the limitations of learning from confounded expert data with and without external reward.
We validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.
arXiv Detail & Related papers (2021-10-13T07:31:31Z) - Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need
in MOOC Forums [58.221459787471254]
Massive Open Online Courses (MOOCs) have become a popular choice for e-learning thanks to their great flexibility.
Due to large numbers of learners and their diverse backgrounds, it is taxing to offer real-time support.
With the large volume of posts and high workloads for MOOC instructors, it is unlikely that the instructors can identify all learners requiring intervention.
This paper explores for the first time Bayesian deep learning on learner-based text posts with two methods: Monte Carlo Dropout and Variational Inference.
arXiv Detail & Related papers (2021-04-26T15:12:13Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z) - Transfer Reinforcement Learning under Unobserved Contextual Information [16.895704973433382]
We study a transfer reinforcement learning problem where the state transitions and rewards are affected by the environmental context.
We develop a method to obtain causal bounds on the transition and reward functions using the demonstrator's data.
We propose new Q learning and UCB-Q learning algorithms that converge to the true value function without bias.
arXiv Detail & Related papers (2020-03-09T22:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.