Related papers: MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

URL: http://arxiv.org/abs/2512.24605v1
Date: Wed, 31 Dec 2025 03:56:28 GMT
Title: MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding
Authors: Panquan Yang, Junfei Huang, Zongzhangbao Yin, Yingsong Hu, Anni Xu, Xinyi Luo, Xueqi Sun, Hai Wu, Sheng Ao, Zhaoxing Zhu, Chenglu Wen, Cheng Wang,
Abstract summary: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences.<n>MoniRefer is the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding.
Score: 30.52190342330071
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

Related papers

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving [7.007334645975593]
We introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving.<n>We propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs with precise localization abilities of specialist detection models.
arXiv Detail & Related papers (2025-03-28T13:55:16Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.<n>Existing approaches commonly encounter a shortage of text3D pairs available for training.<n>We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding [9.289977174410824]
3D visual grounding involves localizing entities in a 3D scene referred to by natural language text.<n>We introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns.
arXiv Detail & Related papers (2025-01-02T17:20:41Z)
Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset. We provide an assessment system that grades natural language responses based on predefined ground-truth answers. Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization [80.3424839706698]
We present NeurOCS, a framework that uses instance masks 3D boxes as input to learn 3D object shapes by means of differentiable rendering. Our approach rests on insights in learning a category-level shape prior directly from real driving scenes. We make critical design choices to learn object coordinates more effectively from an object-centric view.
arXiv Detail & Related papers (2023-05-28T16:18:41Z)
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language [31.691159120136064]
We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data. We present a novel method, dubbed WildRefer, for this task by fully utilizing the rich appearance information in images, the position and geometric clues in point cloud. Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots.
arXiv Detail & Related papers (2023-04-12T06:48:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.