Few-shot Object Grounding and Mapping for Natural Language Robot
Instruction Following
- URL: http://arxiv.org/abs/2011.07384v1
- Date: Sat, 14 Nov 2020 20:35:20 GMT
- Title: Few-shot Object Grounding and Mapping for Natural Language Robot
Instruction Following
- Authors: Valts Blukis, Ross A. Knepper, Yoav Artzi
- Abstract summary: We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects.
We introduce a few-shot language-conditioned object grounding method trained from augmented reality data.
We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output.
- Score: 15.896892723068932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of learning a robot policy to follow natural language
instructions that can be easily extended to reason about new objects. We
introduce a few-shot language-conditioned object grounding method trained from
augmented reality data that uses exemplars to identify objects and align them
to their mentions in instructions. We present a learned map representation that
encodes object locations and their instructed use, and construct it from our
few-shot grounding output. We integrate this mapping approach into an
instruction-following policy, thereby allowing it to reason about previously
unseen objects at test-time by simply adding exemplars. We evaluate on the task
of learning to map raw observations and instructions to continuous control of a
physical quadcopter. Our approach significantly outperforms the prior state of
the art in the presence of new objects, even when the prior approach observes
all objects during training.
Related papers
- Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Few-Shot In-Context Imitation Learning via Implicit Graph Alignment [15.215659641228655]
We formulate imitation learning as a conditional alignment problem between graph representations of objects.
We show that this conditioning allows for in-context learning, where a robot can perform a task on a set of new objects immediately after the demonstrations.
arXiv Detail & Related papers (2023-10-18T18:26:01Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem.
Key challenges for this task include localizing target locations and navigating to them through visual inputs.
We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z) - Learning visual policies for building 3D shape categories [130.7718618259183]
Previous work in this domain often assembles particular instances of objects from known sets of primitives.
We learn a visual policy to assemble other instances of the same category.
Our visual assembly policies are trained with no real images and reach up to 95% success rate when evaluated on a real robot.
arXiv Detail & Related papers (2020-04-15T17:29:10Z) - Learning Object Permanence from Video [46.34427538905761]
This paper introduces the setup of learning Object Permanence from data.
We explain why this learning problem should be dissected into four components, where objects are visible, (2) occluded, (3) contained by another object and (4) carried by a containing object.
We then present a unified deep architecture that learns to predict object location under these four scenarios.
arXiv Detail & Related papers (2020-03-23T18:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.