SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
- URL: http://arxiv.org/abs/2508.20758v1
- Date: Thu, 28 Aug 2025 13:15:37 GMT
- Title: SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
- Authors: Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang, Yuan Xie, Yanyun Qu,
- Abstract summary: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions.<n>We propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning.<n> Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively.
- Score: 40.60812160987424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.
Related papers
- Z3D: Zero-Shot 3D Visual Grounding from Images [7.756226313216256]
3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries.<n>We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images.
arXiv Detail & Related papers (2026-02-03T10:35:18Z) - View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs [19.27758108925572]
3D visual grounding identifies objects in 3D scenes from language descriptions.<n>Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms to VLM processing.<n>We propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning.
arXiv Detail & Related papers (2025-12-10T00:59:17Z) - ChangingGrounding: 3D Visual Grounding in Changing Scenes [92.00984845186679]
Real-world robots localize objects from natural-language instructions while scenes around them keep changing.<n>Most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud.<n>We introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations.
arXiv Detail & Related papers (2025-10-16T17:59:16Z) - Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset [56.533371387182065]
MV-ScanQA is a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views.<n>We present TripAlign, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M 2D view, set of 3D objects, text> triplets.<n>We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign.
arXiv Detail & Related papers (2025-08-14T20:35:59Z) - GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond [56.677984098204696]
multimodal language models are driving the development of 3D Vision-Language Models (VLMs)<n>We propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations.<n>We present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images.
arXiv Detail & Related papers (2025-07-01T15:52:59Z) - Zero-Shot 3D Visual Grounding from Vision-Language Models [10.81711535075112]
3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions.<n>We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training.
arXiv Detail & Related papers (2025-05-28T14:53:53Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - SLGaussian: Fast Language Gaussian Splatting in Sparse Views [15.0280871846496]
We propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints.<n>SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions.
arXiv Detail & Related papers (2024-12-11T12:18:30Z) - VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding [57.04804711488706]
3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding.
We present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images.
arXiv Detail & Related papers (2024-10-17T17:59:55Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.