FILM: Following Instructions in Language with Modular Methods
- URL: http://arxiv.org/abs/2110.07342v1
- Date: Tue, 12 Oct 2021 16:40:01 GMT
- Title: FILM: Following Instructions in Language with Modular Methods
- Authors: So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk,
Ruslan Salakhutdinov
- Abstract summary: Recent methods for embodied instruction following are typically trained end-to-end using imitation learning.
We propose a modular method with structured representations that builds a semantic map of the scene and performs exploration with a semantic search policy.
Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance.
- Score: 109.73082108379936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent methods for embodied instruction following are typically trained
end-to-end using imitation learning. This requires the use of expert
trajectories and low-level language instructions. Such approaches assume
learned hidden states will simultaneously integrate semantics from the language
and vision to perform state tracking, spatial memory, exploration, and
long-term planning. In contrast, we propose a modular method with structured
representations that (1) builds a semantic map of the scene, and (2) performs
exploration with a semantic search policy, to achieve the natural language
goal. Our modular method achieves SOTA performance (24.46%) with a substantial
(8.17 % absolute) gap from previous work while using less data by eschewing
both expert trajectories and low-level instructions. Leveraging low-level
language, however, can further increase our performance (26.49%). Our findings
suggest that an explicit spatial memory and a semantic search policy can
provide a stronger and more general representation for state-tracking and
guidance, even in the absence of expert trajectories or low-level instructions.
Related papers
- Structured Exploration Through Instruction Enhancement for Object
Navigation [0.0]
We propose a hierarchical learning-based method for object navigation.
The top-level is capable of high-level planning, and building a memory on a floorplan-level.
We demonstrate the effectiveness of our method on a dynamic domestic environment.
arXiv Detail & Related papers (2022-11-15T19:39:22Z) - Skill Induction and Planning with Latent Language [94.55783888325165]
We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions.
We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks.
In trained models, the space of natural language commands indexes a library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals.
arXiv Detail & Related papers (2021-10-04T15:36:32Z) - Neural Abstructions: Abstractions that Support Construction for Grounded
Language Learning [69.1137074774244]
Leveraging language interactions effectively requires addressing limitations in the two most common approaches to language grounding.
We introduce the idea of neural abstructions: a set of constraints on the inference procedure of a label-conditioned generative model.
We show that with this method a user population is able to build a semantic modification for an open-ended house task in Minecraft.
arXiv Detail & Related papers (2021-07-20T07:01:15Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Systematic Generalization on gSCAN with Language Conditioned Embedding [19.39687991647301]
Systematic Generalization refers to a learning algorithm's ability to extrapolate learned behavior to unseen situations.
We propose a novel method that learns objects' contextualized embeddings with dynamic message passing conditioned on the input natural language.
arXiv Detail & Related papers (2020-09-11T17:35:05Z) - Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query.
Existing methods independently extract the features of videos and sentences.
We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z) - Multi-View Learning for Vision-and-Language Navigation [163.20410080001324]
Learn from EveryOne (LEO) is a training paradigm for learning to navigate in a visual environment.
By sharing parameters across instructions, our approach learns more effectively from limited training data.
On the recent Room-to-Room (R2R) benchmark dataset, LEO achieves 16% improvement (absolute) over a greedy agent.
arXiv Detail & Related papers (2020-03-02T13:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.