LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers
- URL: http://arxiv.org/abs/2411.04351v1
- Date: Thu, 07 Nov 2024 01:12:01 GMT
- Title: LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers
- Authors: Yeong-Seung Baek, Heung-Seon Oh,
- Abstract summary: LidaRefer is a transformer-based 3D VG framework designed for large-scale outdoor scenes.
We introduce a simple and effective localization method, which supervises the decoder's queries to localize ambiguous objects.
LidaRefer achieves state-of-the-art performance on Talk2Car-3D, a 3D VG dataset for autonomous driving.
- Score: 1.0589208420411014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D visual grounding (VG) aims to locate relevant objects or regions within 3D scenes based on natural language descriptions. Although recent methods for indoor 3D VG have successfully transformer-based architectures to capture global contextual information and enable fine-grained cross-modal fusion, they are unsuitable for outdoor environments due to differences in the distribution of point clouds between indoor and outdoor settings. Specifically, first, extensive LiDAR point clouds demand unacceptable computational and memory resources within transformers due to the high-dimensional visual features. Second, dominant background points and empty spaces in sparse LiDAR point clouds complicate cross-modal fusion owing to their irrelevant visual information. To address these challenges, we propose LidaRefer, a transformer-based 3D VG framework designed for large-scale outdoor scenes. Moreover, during training, we introduce a simple and effective localization method, which supervises the decoder's queries to localize not only a target object but also ambiguous objects that might be confused as the target due to the exhibition of similar attributes in a scene or the incorrect understanding of a language description. This supervision enhances the model's ability to distinguish ambiguous objects from a target by learning the differences in their spatial relationships and attributes. LidaRefer achieves state-of-the-art performance on Talk2Car-3D, a 3D VG dataset for autonomous driving, with significant improvements under various evaluation settings.
Related papers
- ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions.
Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals.
We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z) - Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving [25.28104119280405]
We propose TPCNet, the first outdoor 3D visual grounding model upon the paradigm of prompt-guided point cloud sensor combination.
To balance the features of these two sensors required by the prompt, we have designed a multi-fusion paradigm called Two-Stage Heterogeneous Modal Adaptive Fusion.
Our experiments have demonstrated that TPCNet achieves the state-of-the-art performance on both the Talk2Radar and Talk2Car datasets.
arXiv Detail & Related papers (2025-03-11T11:48:27Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.
Existing approaches commonly encounter a shortage of text3D pairs available for training.
We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping [34.98831146003579]
One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem.
We propose the textitneural attention field for representing semantic-aware dense feature fields in the 3D space.
arXiv Detail & Related papers (2024-10-30T14:06:51Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - Uni3DETR: Unified 3D Detection Transformer [75.35012428550135]
We propose a unified 3D detector that addresses indoor and outdoor detection within the same framework.
Specifically, we employ the detection transformer with point-voxel interaction for object prediction.
We then propose the mixture of query points, which sufficiently exploits global information for dense small-range indoor scenes and local information for large-range sparse outdoor ones.
arXiv Detail & Related papers (2023-10-09T13:20:20Z) - Dense Object Grounding in 3D Scenes [28.05720194887322]
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding.
We introduce 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence.
Our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.
arXiv Detail & Related papers (2023-09-05T13:27:19Z) - Language-Guided 3D Object Detection in Point Cloud for Autonomous
Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding.
It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector.
Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - SWFormer: Sparse Window Transformer for 3D Object Detection in Point
Clouds [44.635939022626744]
3D object detection in point clouds is a core component for modern robotics and autonomous driving systems.
Key challenge in 3D object detection comes from the inherent sparse nature of point occupancy within the 3D scene.
We propose Sparse Window Transformer (SWFormer), a scalable and accurate model for 3D object detection.
arXiv Detail & Related papers (2022-10-13T21:37:53Z) - AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation.
We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.