CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2211.16649v1
- Date: Wed, 30 Nov 2022 00:38:54 GMT
- Title: CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation
- Authors: Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse
Thomason, Gaurav S. Sukhatme
- Abstract summary: Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity.
We ask if Vision-Language models like CLIP are also capable of zero-shot language grounding.
- Score: 17.443411731092567
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Household environments are visually diverse. Embodied agents performing
Vision-and-Language Navigation (VLN) in the wild must be able to handle this
diversity, while also following arbitrary language instructions. Recently,
Vision-Language models like CLIP have shown great performance on the task of
zero-shot object recognition. In this work, we ask if these models are also
capable of zero-shot language grounding. In particular, we utilize CLIP to
tackle the novel problem of zero-shot VLN using natural language referring
expressions that describe target objects, in contrast to past work that used
simple language templates describing object classes. We examine CLIP's
capability in making sequential navigational decisions without any
dataset-specific finetuning, and study how it influences the path that an agent
takes. Our results on the coarse-grained instruction following task of REVERIE
demonstrate the navigational capability of CLIP, surpassing the supervised
baseline in terms of both success rate (SR) and success weighted by path length
(SPL). More importantly, we quantitatively show that our CLIP-based zero-shot
approach generalizes better to show consistent performance across environments
when compared to SOTA, fully supervised learning approaches when evaluated via
Relative Change in Success (RCS).
Related papers
- Visual Grounding for Object-Level Generalization in Reinforcement Learning [35.39214541324909]
Generalization is a pivotal challenge for agents following natural language instructions.
We leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning.
We show that our intrinsic reward significantly improves performance on challenging skill learning.
arXiv Detail & Related papers (2024-08-04T06:34:24Z) - SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models [19.005364038603204]
We introduce a novel fine-tuning paradigm named Self-Consistency Tuning (SC-Tune)
SC-Tune features the synergistic learning of a cyclic describer-locator system.
We demonstrate that SC-Tune significantly elevates performance across a spectrum of object-level vision-language benchmarks.
arXiv Detail & Related papers (2024-03-20T03:00:21Z) - TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment [102.17010696898113]
We show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.
We propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task.
arXiv Detail & Related papers (2022-03-14T15:29:27Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.