Find What You Want: Learning Demand-conditioned Object Attribute Space
for Demand-driven Navigation
- URL: http://arxiv.org/abs/2309.08138v3
- Date: Mon, 6 Nov 2023 11:02:58 GMT
- Title: Find What You Want: Learning Demand-conditioned Object Attribute Space
for Demand-driven Navigation
- Authors: Hongcheng Wang, Andy Guan Hong Chen, Xiaoqi Li, Mingdong Wu, Hao Dong
- Abstract summary: The task of Visual Object Navigation (VON) involves an agent's ability to locate a particular object within a given scene.
In real-world scenarios, it is often challenging to ensure that these conditions are always met.
We propose Demand-driven Navigation (DDN), which leverages the user's demand as the task instruction.
- Score: 5.106884746419666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Visual Object Navigation (VON) involves an agent's ability to
locate a particular object within a given scene. In order to successfully
accomplish the VON task, two essential conditions must be fulfilled:1) the user
must know the name of the desired object; and 2) the user-specified object must
actually be present within the scene. To meet these conditions, a simulator can
incorporate pre-defined object names and positions into the metadata of the
scene. However, in real-world scenarios, it is often challenging to ensure that
these conditions are always met. Human in an unfamiliar environment may not
know which objects are present in the scene, or they may mistakenly specify an
object that is not actually present. Nevertheless, despite these challenges,
human may still have a demand for an object, which could potentially be
fulfilled by other objects present within the scene in an equivalent manner.
Hence, we propose Demand-driven Navigation (DDN), which leverages the user's
demand as the task instruction and prompts the agent to find the object matches
the specified demand. DDN aims to relax the stringent conditions of VON by
focusing on fulfilling the user's demand rather than relying solely on
predefined object categories or names. We propose a method first acquire
textual attribute features of objects by extracting common knowledge from a
large language model. These textual attribute features are subsequently aligned
with visual attribute features using Contrastive Language-Image Pre-training
(CLIP). By incorporating the visual attribute features as prior knowledge, we
enhance the navigation process. Experiments on AI2Thor with the ProcThor
dataset demonstrate the visual attribute features improve the agent's
navigation performance and outperform the baseline methods commonly used in
VON.
Related papers
- Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments [44.6372390798904]
We propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object.
In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions.
arXiv Detail & Related papers (2024-10-23T18:01:09Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open
Environments [170.43912741137655]
We construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO)
RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories.
We evaluate the ability of some existing models to reason intention-oriented objects in open environments.
arXiv Detail & Related papers (2023-10-26T10:15:21Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection [42.2847114428716]
Task driven object detection aims to detect object instances suitable for affording a task in an image.
Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection.
We propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task.
arXiv Detail & Related papers (2023-09-03T06:18:39Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Instance-Specific Image Goal Navigation: Training Embodied Agents to
Find Object Instances [90.61897965658183]
We consider the problem of embodied visual navigation given an image-goal (ImageNav)
Unlike related navigation tasks, ImageNav does not have a standardized task definition which makes comparison across methods difficult.
We present the Instance-specific ImageNav task (ImageNav) to address these limitations.
arXiv Detail & Related papers (2022-11-29T02:29:35Z) - SceneGen: Generative Contextual Scene Augmentation using Scene Graph
Priors [3.1969855247377827]
We introduce SceneGen, a generative contextual augmentation framework that predicts virtual object positions and orientations within existing scenes.
SceneGen takes a semantically segmented scene as input, and outputs positional and orientational probability maps for placing virtual content.
We formulate a novel spatial Scene Graph representation, which encapsulates explicit topological properties between objects, object groups, and the room.
To demonstrate our system in action, we develop an Augmented Reality application, in which objects can be contextually augmented in real-time.
arXiv Detail & Related papers (2020-09-25T18:36:27Z) - ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to
Objects [119.46959413000594]
This document summarizes the consensus recommendations of a working group on ObjectNav.
We make recommendations on subtle but important details of evaluation criteria.
We provide a detailed description of the instantiation of these recommendations in challenges organized at the Embodied AI workshop at CVPR 2020.
arXiv Detail & Related papers (2020-06-23T17:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.