Zero-Shot Compositional Policy Learning via Language Grounding
- URL: http://arxiv.org/abs/2004.07200v2
- Date: Mon, 17 Apr 2023 17:36:49 GMT
- Title: Zero-Shot Compositional Policy Learning via Language Grounding
- Authors: Tianshi Cao, Jingkang Wang, Yining Zhang, Sivabalan Manivasagam
- Abstract summary: Humans can adapt to new tasks quickly by leveraging prior knowledge about the world such as language descriptions.
We introduce a new research platform BabyAI++ in which the dynamics of environments are disentangled from visual appearance.
We find that current language-guided RL/IL techniques overfit to the training environments and suffer from a huge performance drop when facing unseen combinations.
- Score: 13.45138913186308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent breakthroughs in reinforcement learning (RL) and imitation
learning (IL), existing algorithms fail to generalize beyond the training
environments. In reality, humans can adapt to new tasks quickly by leveraging
prior knowledge about the world such as language descriptions. To facilitate
the research on language-guided agents with domain adaption, we propose a novel
zero-shot compositional policy learning task, where the environments are
characterized as a composition of different attributes. Since there are no
public environments supporting this study, we introduce a new research platform
BabyAI++ in which the dynamics of environments are disentangled from visual
appearance. At each episode, BabyAI++ provides varied vision-dynamics
combinations along with corresponding descriptive texts. To evaluate the
adaption capability of learned agents, a set of vision-dynamics pairings are
held-out for testing on BabyAI++. Unsurprisingly, we find that current
language-guided RL/IL techniques overfit to the training environments and
suffer from a huge performance drop when facing unseen combinations. In
response, we propose a multi-modal fusion method with an attention mechanism to
perform visual language-grounding. Extensive experiments show strong evidence
that language grounding is able to improve the generalization of agents across
environments with varied dynamics.
Related papers
- Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use [16.425032085699698]
It is desirable for embodied agents to have the ability to leverage human language to gain explicit or implicit knowledge for learning tasks.
It's not clear how to incorporate rich language use to facilitate task learning.
This paper studies different types of language inputs in facilitating reinforcement learning.
arXiv Detail & Related papers (2024-10-31T17:59:52Z) - How language models extrapolate outside the training data: A case study in Textualized Gridworld [32.5268320198854]
We show that conventional approaches, including next-token prediction and Chain of Thought fine-tuning, fail to generalize in larger, unseen environments.
Inspired by human cognition and dual-process theory, we propose language models should construct cognitive maps before interaction.
arXiv Detail & Related papers (2024-06-21T16:10:05Z) - LanGWM: Language Grounded World Model [24.86620763902546]
We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
arXiv Detail & Related papers (2023-11-29T12:41:55Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Improving Policy Learning via Language Dynamics Distillation [87.27583619910338]
We propose Language Dynamics Distillation (LDD), which pretrains a model to predict environment dynamics given demonstrations with language descriptions.
We show that language descriptions in demonstrations improve sample-efficiency and generalization across environments.
arXiv Detail & Related papers (2022-09-30T19:56:04Z) - Inner Monologue: Embodied Reasoning through Planning with Language
Models [81.07216635735571]
Large Language Models (LLMs) can be applied to domains beyond natural language processing.
LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them.
We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios.
arXiv Detail & Related papers (2022-07-12T15:20:48Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement
Learning [14.553086325168803]
We present VisualHints, a novel environment for multimodal reinforcement learning (RL) involving text-based interactions along with visual hints (obtained from the environment)
We introduce an extension of the TextWorld cooking environment with the addition of visual clues interspersed throughout the environment.
The goal is to force an RL agent to use both text and visual features to predict natural language action commands for solving the final task of cooking a meal.
arXiv Detail & Related papers (2020-10-26T18:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.