Related papers: SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

URL: http://arxiv.org/abs/2505.18881v1
Date: Sat, 24 May 2025 21:37:06 GMT
Title: SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes
Authors: Dicong Qiu, Jiadi You, Zeying Gong, Ronghe Qiu, Hui Xiong, Junwei Liang,
Abstract summary: We present the Semantics-aware dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON)<n>It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents.<n>We offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task.
Score: 15.178229677519063
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.

Related papers

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
SEMNAV: A Semantic Segmentation-Driven Approach to Visual Semantic Navigation [1.2582887633807602]
Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment.<n>Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best.<n>We propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment.
arXiv Detail & Related papers (2025-06-02T08:19:41Z)
HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation [39.54854283833085]
We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON) HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories. We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach.
arXiv Detail & Related papers (2024-09-22T02:12:29Z)
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation [25.51740922661166]
We introduce CityNav, the first large-scale real-world dataset for aerial VLN.<n>Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description.<n>We provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation.
arXiv Detail & Related papers (2024-06-20T12:08:27Z)
Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation. Our method significantly outperforms the state of the art on the challenging MP3D dataset. We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z)
Learning Navigational Visual Representations with Semantic Map Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps. Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z)
Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation [70.82403156865057]
We investigate the impact of synthetic 3D scene dataset scale and realism on the task of training embodied agents to find and navigate to objects. Our experiments show that agents trained on our smaller-scale dataset can match or outperform agents trained on much larger datasets.
arXiv Detail & Related papers (2023-06-20T05:07:23Z)
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting [64.7364925689825]
Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain. The Lidar dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. The Motion Forecasting dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene.
arXiv Detail & Related papers (2023-01-02T00:36:22Z)
PEANUT: Predicting and Navigating to Unseen Targets [18.87376347895365]
Efficient ObjectGoal navigation (ObjectNav) in novel environments requires an understanding of the spatial and semantic regularities in environment layouts. We present a method for learning these regularities by predicting the locations of unobserved objects from incomplete semantic maps. Our prediction model is lightweight and can be trained in a supervised manner using a relatively small amount of passively collected data.
arXiv Detail & Related papers (2022-12-05T18:58:58Z)
Image-based Navigation in Real-World Environments via Multiple Mid-level Representations: Fusion Models, Benchmark and Efficient Evaluation [13.207579081178716]
In recent learning-based navigation approaches, the scene understanding and navigation abilities of the agent are achieved simultaneously. Unfortunately, even if simulators represent an efficient tool to train navigation policies, the resulting models often fail when transferred into the real world. One possible solution is to provide the navigation model with mid-level visual representations containing important domain-invariant properties of the scene.
arXiv Detail & Related papers (2022-02-02T15:00:44Z)
ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon. During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment. We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.