Related papers: 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

URL: http://arxiv.org/abs/2602.12159v1
Date: Thu, 12 Feb 2026 16:41:26 GMT
Title: 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting
Authors: Wancai Zheng, Hao Chen, Xianlong Lu, Linlin Ou, Xinyi Yu,
Abstract summary: 3DGSNav is a novel framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for vision-language models (VLMs) to enhance spatial reasoning.<n>3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views.<n>During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification.
Score: 12.057873540714098
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/

Related papers

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z)
RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation [9.379574254353352]
RANGER is a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera.<n>By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning.<n> Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency.
arXiv Detail & Related papers (2025-12-30T13:25:22Z)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation [5.343932820859596]
This paper introduces a novel zero-shot ObjectNav framework that pioneers the use of dynamic, history-aware prompting.<n>Our core innovation lies in providing the VLM with action history context, enabling it to generate semantic guidance scores for navigation actions.<n>We also introduce a VLM-assisted waypoint generation mechanism for refining the final approach to detected objects.
arXiv Detail & Related papers (2025-06-19T21:50:16Z)
SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models [10.671262416557704]
Vision Foundation Models (VFMs) offer powerful capabilities for visual understanding and reasoning.<n>We present a zero-shot object goal navigation framework that integrates the perceptual strength of VFMs with a model-based planner.<n>We evaluate our approach on the HM3D dataset using the Habitat simulator and demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-06-04T03:04:54Z)
HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation [39.54854283833085]
We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON) HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories. We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach.
arXiv Detail & Related papers (2024-09-22T02:12:29Z)
GaussNav: Gaussian Splatting for Visual Navigation [92.13664084464514]
Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment.<n>We propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS)<n>Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.
arXiv Detail & Related papers (2024-03-18T09:56:48Z)
OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm. We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z)
Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON) Our approach makes use of Large Language Models (LLMs) for this task. We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z)
AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation. We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z)
Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges. We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible. Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z)
Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes. Our proposed method combines visual features and 3D spatial representations to learn navigation policy. Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.