Related papers: DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

URL: http://arxiv.org/abs/2410.02730v3
Date: Mon, 01 Sep 2025 03:33:43 GMT
Title: DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes
Authors: Zhaowei Wang, Hongming Zhang, Tianqing Fang, Ye Tian, Yue Yang, Kaixin Ma, Xiaoman Pan, Yangqiu Song, Dong Yu,
Abstract summary: We first study the challenge of open-vocabulary object navigation by introducing DivScene.<n>Our dataset provides a much greater diversity of target objects and scene types than existing datasets.<n>We fine-tuned LVLMs to predict the next action with CoT explanations.
Score: 76.24687327731031
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DivScene, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs to predict the next action with CoT explanations. We observe that LVLM's navigation ability can be improved substantially with only BFS-generated shortest paths without any human supervision, surpassing GPT-4o by over 20% in success rates.

Related papers

FOM-Nav: Frontier-Object Maps for Object Goal Navigation [65.76906445210112]
FOM-Nav is a framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models.<n>To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments.<n> FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL.
arXiv Detail & Related papers (2025-11-30T18:16:09Z)
History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation [5.343932820859596]
This paper introduces a novel zero-shot ObjectNav framework that pioneers the use of dynamic, history-aware prompting.<n>Our core innovation lies in providing the VLM with action history context, enabling it to generate semantic guidance scores for navigation actions.<n>We also introduce a VLM-assisted waypoint generation mechanism for refining the final approach to detected objects.
arXiv Detail & Related papers (2025-06-19T21:50:16Z)
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information.<n>To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z)
ROOT: VLM based System for Indoor Scene Understanding and Beyond [83.71252153660078]
ROOT is a VLM-based system designed to enhance the analysis of indoor scenes. rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI.
arXiv Detail & Related papers (2024-11-24T04:51:24Z)
SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation [83.4599149936183]
Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects. We propose to represent the observed scene with 3D scene graph. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks.
arXiv Detail & Related papers (2024-10-10T17:57:19Z)
HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation [39.54854283833085]
We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON) HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories. We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach.
arXiv Detail & Related papers (2024-09-22T02:12:29Z)
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments [12.428873051106702]
Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks.<n>LLMs struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models.<n>We introduce FLAME, a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks.
arXiv Detail & Related papers (2024-08-20T17:57:46Z)
Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information. In this study, we find that the intermediate layers of models can encode more global semantic information. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models [30.20915403608803]
Griffon is a language-prompted localization dataset for large vision language models. It is trained end-to-end through a well-designed pipeline. It achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities.
arXiv Detail & Related papers (2023-11-24T15:35:07Z)
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON) Our approach makes use of Large Language Models (LLMs) for this task. We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.