Selective Visual Representations Improve Convergence and Generalization
for Embodied AI
- URL: http://arxiv.org/abs/2311.04193v2
- Date: Sun, 10 Mar 2024 01:55:47 GMT
- Title: Selective Visual Representations Improve Convergence and Generalization
for Embodied AI
- Authors: Ainaz Eftekhar, Kuo-Hao Zeng, Jiafei Duan, Ali Farhadi, Ani Kembhavi,
Ranjay Krishna
- Abstract summary: Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations.
This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues.
Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI.
- Score: 44.33711781750707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied AI models often employ off the shelf vision backbones like CLIP to
encode their visual observations. Although such general purpose representations
encode rich syntactic and semantic information about the scene, much of this
information is often irrelevant to the specific task at hand. This introduces
noise within the learning process and distracts the agent's focus from
task-relevant visual cues. Inspired by selective attention in humans-the
process through which people filter their perception based on their
experiences, knowledge, and the task at hand-we introduce a parameter-efficient
approach to filter visual stimuli for embodied AI. Our approach induces a
task-conditioned bottleneck using a small learnable codebook module. This
codebook is trained jointly to optimize task reward and acts as a
task-conditioned selective filter over the visual observation. Our experiments
showcase state-of-the-art performance for object goal navigation and object
displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR,
and ManipulaTHOR. The filtered representations produced by the codebook are
also able generalize better and converge faster when adapted to other
simulation environments such as Habitat. Our qualitative analyses show that
agents explore their environments more effectively and their representations
retain task-relevant information like target object recognition while ignoring
superfluous information about other objects. Code and pretrained models are
available at our project website: https://embodied-codebook.github.io.
Related papers
- Augmented Commonsense Knowledge for Remote Object Grounding [67.30864498454805]
We propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as atemporal knowledge graph for improving agent navigation.
ACK consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment.
We add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction.
arXiv Detail & Related papers (2024-06-03T12:12:33Z) - Learning Object-Centric Representation via Reverse Hierarchy Guidance [73.05170419085796]
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes.
RHGNet introduces a top-down pathway that works in different ways in the training and inference processes.
Our model achieves SOTA performance on several commonly used datasets.
arXiv Detail & Related papers (2024-05-17T07:48:27Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - What do navigation agents learn about their environment? [39.74076893981299]
We introduce the Interpretability System for Embodied agEnts (iSEE) for Point Goal and Object Goal navigation agents.
We use iSEE to probe the dynamic representations produced by these agents for the presence of information about the agent as well as the environment.
arXiv Detail & Related papers (2022-06-17T01:33:43Z) - Embodied Visual Active Learning for Semantic Segmentation [33.02424587900808]
We study the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding.
We develop a battery of agents - both learnt and pre-specified - and with different levels of knowledge of the environment.
We extensively evaluate the proposed models using the Matterport3D simulator and show that a fully learnt method outperforms comparable pre-specified counterparts.
arXiv Detail & Related papers (2020-12-17T11:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.