Simple but Effective: CLIP Embeddings for Embodied AI
- URL: http://arxiv.org/abs/2111.09888v1
- Date: Thu, 18 Nov 2021 18:59:59 GMT
- Title: Simple but Effective: CLIP Embeddings for Embodied AI
- Authors: Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, Aniruddha Kembhavi
- Abstract summary: Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks.
We build incredibly simple baselines, named EmbCLIP, with no task specific architectures.
We find that our improved baselines perform very well across a range of tasks and simulators.
- Score: 38.02562593292301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language image pretraining (CLIP) encoders have been shown to be
beneficial for a range of visual tasks from classification and detection to
captioning and image manipulation. We investigate the effectiveness of CLIP
visual backbones for embodied AI tasks. We build incredibly simple baselines,
named EmbCLIP, with no task specific architectures, inductive biases (such as
the use of semantic maps), auxiliary tasks during training, or depth maps --
yet we find that our improved baselines perform very well across a range of
tasks and simulators. EmbCLIP tops the RoboTHOR ObjectNav leaderboard by a huge
margin of 20 pts (Success Rate). It tops the iTHOR 1-Phase Rearrangement
leaderboard, beating the next best submission, which employs Active Neural
Mapping, and more than doubling the % Fixed Strict metric (0.08 to 0.17). It
also beats the winners of the 2021 Habitat ObjectNav Challenge, which employ
auxiliary tasks, depth maps, and human demonstrations, and those of the 2019
Habitat PointNav Challenge. We evaluate the ability of CLIP's visual
representations at capturing semantic information about input observations --
primitives that are useful for navigation-heavy embodied tasks -- and find that
CLIP's representations encode these primitives more effectively than
ImageNet-pretrained backbones. Finally, we extend one of our baselines,
producing an agent capable of zero-shot object navigation that can navigate to
objects that were not used as targets during training.
Related papers
- CLIP with Quality Captions: A Strong Pretraining for Vision Tasks [16.208506912410147]
We show that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods.
We find that mobile architectures also benefit significantly from CLIP pretraining.
arXiv Detail & Related papers (2024-05-14T19:06:24Z) - Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav [62.32806118504701]
We present a single neural network architecture that achieves state-of-art results on both the ImageNav and ObjectNav tasks.
Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks.
arXiv Detail & Related papers (2023-03-14T11:15:37Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z) - MaAST: Map Attention with Semantic Transformersfor Efficient Visual
Navigation [4.127128889779478]
This work focuses on performing better or comparable to the existing learning-based solutions for visual navigation for autonomous agents.
We propose a method to encode vital scene semantics into a semantically informed, top-down egocentric map representation.
We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-03-21T12:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.