Related papers: VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

Related papers

Z3D: Zero-Shot 3D Visual Grounding from Images [7.756226313216256]
3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries.<n>We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images.
arXiv Detail & Related papers (2026-02-03T10:35:18Z)
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight [105.9472902251177]
We present a VLM-native recipe that casts 3D detection as a next-token prediction problem.<n>Our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement.
arXiv Detail & Related papers (2025-11-25T18:59:45Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding [40.60812160987424]
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions.<n>We propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning.<n> Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively.
arXiv Detail & Related papers (2025-08-28T13:15:37Z)
Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset [56.533371387182065]
MV-ScanQA is a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views.<n>We present TripAlign, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M 2D view, set of 3D objects, text> triplets.<n>We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign.
arXiv Detail & Related papers (2025-08-14T20:35:59Z)
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors [23.66183317100899]
Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos.<n>We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM)<n>Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences.
arXiv Detail & Related papers (2025-05-30T14:16:41Z)
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D [68.23391872643268]
LOCATE 3D is a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp" It operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
arXiv Detail & Related papers (2025-04-19T02:51:24Z)
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding [10.81711535075112]
3D Visual Grounding aims to locate objects in 3D scenes based on textual descriptions. We introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions.
arXiv Detail & Related papers (2024-12-05T17:58:43Z)
EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration. An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z)
LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories. We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
Unified Scene Representation and Reconstruction for 3D Large Language Models [40.693839066536505]
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. We introduce Uni3DR2 extracts 3D geometric and semantic aware representation features via the frozen 2D foundation models. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs.
arXiv Detail & Related papers (2024-04-19T17:58:04Z)
VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection [80.62052650370416]
monocular 3D object detection holds significant importance across various applications, including autonomous driving and robotics. In this paper, we present VFMM3D, an innovative framework that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations.
arXiv Detail & Related papers (2024-04-15T03:12:12Z)
Data-Efficient 3D Visual Grounding via Order-Aware Referring [31.96736077210907]
Vigor is a novel Data-Efficient 3D Visual Grounding framework via Order-aware Referring. We present an order-aware warm-up training strategy, which augments referential orders for pre-training the visual grounding framework.
arXiv Detail & Related papers (2024-03-25T08:31:14Z)
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z)
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans [6.936271803454143]
We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG) We created RIORefer, a large-scale 3D visual grounding dataset. It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan.
arXiv Detail & Related papers (2023-05-23T09:52:49Z)
CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework. Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene. In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection [3.330229314824913]
We present FCAF3D - a first-in-class fully convolutional anchor-free indoor 3D object detection method. It is a simple yet effective method that uses a voxel representation of a point cloud and processes voxels with sparse convolutions. It can handle large-scale scenes with minimal runtime through a single fully convolutional feed-forward pass.
arXiv Detail & Related papers (2021-12-01T07:28:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.