VISITRON: Visual Semantics-Aligned Interactively Trained
Object-Navigator
- URL: http://arxiv.org/abs/2105.11589v1
- Date: Tue, 25 May 2021 00:21:54 GMT
- Title: VISITRON: Visual Semantics-Aligned Interactively Trained
Object-Navigator
- Authors: Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson
Piramuthu, Gokhan T\"ur, Devi Parikh, Dilek Hakkani-T\"ur
- Abstract summary: Interactive robots navigating photo-realistic environments face challenges underlying vision-and-language navigation (VLN)
We present VISITRON, a navigator better suited to the interactive regime inherent to CVDN.
We perform extensive ablations with VISITRON to gain empirical insights and improve performance on CVDN.
- Score: 41.060371177425175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactive robots navigating photo-realistic environments face challenges
underlying vision-and-language navigation (VLN), but in addition, they need to
be trained to handle the dynamic nature of dialogue. However, research in
Cooperative Vision-and-Dialog Navigation (CVDN), where a navigator interacts
with a guide in natural language in order to reach a goal, treats the dialogue
history as a VLN-style static instruction. In this paper, we present VISITRON,
a navigator better suited to the interactive regime inherent to CVDN by being
trained to: i) identify and associate object-level concepts and semantics
between the environment and dialogue history, ii) identify when to interact vs.
navigate via imitation learning of a binary classification head. We perform
extensive ablations with VISITRON to gain empirical insights and improve
performance on CVDN. VISITRON is competitive with models on the static CVDN
leaderboard. We also propose a generalized interactive regime to fine-tune and
evaluate VISITRON and future such models with pre-trained guides for
adaptability.
Related papers
- Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT)
Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules.
Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z) - VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation.
Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects.
We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Transferring ConvNet Features from Passive to Active Robot
Self-Localization: The Use of Ego-Centric and World-Centric Views [2.362412515574206]
A standard VPR subsystem is assumed to be available, and its domain-invariant state recognition ability is proposed to be transferred to train the domain-invariant NBV planner.
We divide the visual cues that are available from the CNN model into two types: the output layer cue (OLC) and intermediate layer cue (ILC)
In our framework, the ILC and OLC are mapped to a state vector and subsequently used to train a multiview NBV planner via deep reinforcement learning.
arXiv Detail & Related papers (2022-04-22T04:42:33Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - Vision-Dialog Navigation by Exploring Cross-modal Memory [107.13970721435571]
Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets.
We propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions.
Our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.
arXiv Detail & Related papers (2020-03-15T03:08:06Z) - Environment-agnostic Multitask Learning for Natural Language Grounded
Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks.
Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.