Object-and-Action Aware Model for Visual Language Navigation
- URL: http://arxiv.org/abs/2007.14626v1
- Date: Wed, 29 Jul 2020 06:32:18 GMT
- Title: Object-and-Action Aware Model for Visual Language Navigation
- Authors: Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, Qi Wu
- Abstract summary: Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions.
We propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately.
This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly.
- Score: 70.33142095637515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) is unique in that it requires turning
relatively general natural-language instructions into robot agent actions, on
the basis of the visible environment. This requires to extract value from two
very different types of natural-language information. The first is object
description (e.g., 'table', 'door'), each presenting as a tip for the agent to
determine the next action by finding the item visible in the environment, and
the second is action specification (e.g., 'go straight', 'turn left') which
allows the robot to directly predict the next movements without relying on
visual perceptions. However, most existing methods pay few attention to
distinguish these information from each other during instruction encoding and
mix together the matching between textual object/action encoding and visual
perception/orientation features of candidate viewpoints. In this paper, we
propose an Object-and-Action Aware Model (OAAM) that processes these two
different forms of natural language based instruction separately. This enables
each process to match object-centered/action-centered instruction to their own
counterpart visual perception/action orientation flexibly. However, one
side-issue caused by above solution is that an object mentioned in instructions
may be observed in the direction of two or more candidate viewpoints, thus the
OAAM may not predict the viewpoint on the shortest path as the next action. To
handle this problem, we design a simple but effective path loss to penalize
trajectories deviating from the ground truth path. Experimental results
demonstrate the effectiveness of the proposed model and path loss, and the
superiority of their combination with a 50% SPL score on the R2R dataset and a
40% CLS score on the R4R dataset in unseen environments, outperforming the
previous state-of-the-art.
Related papers
- SOOD: Towards Semi-Supervised Oriented Object Detection [57.05141794402972]
This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework.
Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark.
arXiv Detail & Related papers (2023-04-10T11:10:42Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Look Wide and Interpret Twice: Improving Performance on Interactive
Instruction-following Tasks [29.671268927569063]
Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task.
This paper proposes a new method, which outperforms the previous methods by a large margin.
arXiv Detail & Related papers (2021-06-01T16:06:09Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.