Related papers: Text-Driven 3D Lidar Place Recognition for Autonomous Driving

Text-Driven 3D Lidar Place Recognition for Autonomous Driving

URL: http://arxiv.org/abs/2503.18035v2
Date: Tue, 15 Apr 2025 08:22:14 GMT
Title: Text-Driven 3D Lidar Place Recognition for Autonomous Driving
Authors: Tianyi Shang, Zhenyu Li, Pengjie Xu, Zhaojun Deng, Ruirui Zhang,
Abstract summary: We present Des4Pos, a novel two-stage text-driven remote sensing localization framework.<n>It attains a top-1 accuracy of 40% and a top-10 accuracy of 77% under a 5-meter radius threshold.<n>Experiments on the KITTI360Pose test set demonstrate that Des4Pos state-of-the-art performance in text-to-point-cloud place recognition.
Score: 2.3093110834423616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Environment description-based localization in large-scale point cloud maps constructed through remote sensing is critically significant for the advancement of large-scale autonomous systems, such as delivery robots operating in the last mile. However, current approaches encounter challenges due to the inability of point cloud encoders to effectively capture local details and long-range spatial relationships, as well as a significant modality gap between text and point cloud representations. To address these challenges, we present Des4Pos, a novel two-stage text-driven remote sensing localization framework. In the coarse stage, the point-cloud encoder utilizes the Multi-scale Fusion Attention Mechanism (MFAM) to enhance local geometric features, followed by a bidirectional Long Short-Term Memory (LSTM) module to strengthen global spatial relationships. Concurrently, the Stepped Text Encoder (STE) integrates cross-modal prior knowledge from CLIP [1] and aligns text and point-cloud features using this prior knowledge, effectively bridging modality discrepancies. In the fine stage, we introduce a Cascaded Residual Attention (CRA) module to fuse cross-modal features and predict relative localization offsets, thereby achieving greater localization precision. Experiments on the KITTI360Pose test set demonstrate that Des4Pos achieves state-of-the-art performance in text-to-point-cloud place recognition. Specifically, it attains a top-1 accuracy of 40% and a top-10 accuracy of 77% under a 5-meter radius threshold, surpassing the best existing methods by 7% and 7%, respectively.

Related papers

SpatiaLoc: Leveraging Multi-Level Spatial Enhanced Descriptors for Cross-Modal Localization [14.55605595737025]
Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions.<n>We present SpatiaLoc, a framework that emphasizes spatial relationships at both the instance and global levels.<n>Experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2026-01-07T04:50:39Z)
TransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder [66.22997415145467]
This paper presents a joint completion and detection framework that improves the detection feature in sparse areas.<n> Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion networks.<n>The results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods.
arXiv Detail & Related papers (2025-12-12T00:08:03Z)
Generative MIMO Beam Map Construction for Location Recovery and Beam Tracking [67.65578956523403]
This paper proposes a generative framework to recover location labels directly from sparse channel state information (CSI) measurements.<n>Instead of directly storing raw CSI, we learn a compact low-dimensional radio map embedding and leverage a generative model to reconstruct the high-dimensional CSI.<n> Numerical experiments demonstrate that the proposed model can improve localization accuracy by over 30% and achieve a 20% capacity gain in non-line-of-sight (NLOS) scenarios.
arXiv Detail & Related papers (2025-11-21T07:25:49Z)
Cross3DReg: Towards a Large-scale Real-world Cross-source Point Cloud Registration Benchmark [57.42211080221526]
Cross-source point cloud registration, which aims to align point cloud data from different sensors, is a fundamental task in 3D vision.<n>The lack of publicly available large-scale real-world datasets for training the deep registration models, and the inherent differences in point clouds captured by multiple sensors pose challenges.<n>We construct Cross3DReg, the currently largest and real-world multi-modal cross-source point cloud registration dataset.<n>A visual-geometric attention guided matching module is proposed to enhance the consistency of cross-source point cloud features.
arXiv Detail & Related papers (2025-09-08T09:01:13Z)
Ground Awareness in Deep Learning for Large Outdoor Point Cloud Segmentation [0.0]
In dense outdoor point clouds, the receptive field of a machine learning model may be too small to accurately determine the surroundings and context of a point. By computing Digital Terrain Models (DTMs) from the point clouds, we extract the relative elevation feature, which is the vertical distance from the terrain to a point. RandLA-Net is employed for efficient semantic segmentation of large-scale point clouds.
arXiv Detail & Related papers (2025-01-30T10:27:28Z)
MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies [12.485905108032146]
This paper introduces MetaOcc, a novel multi-modal framework for omni-oriented 3D occupancy prediction.<n>To address the limitations of directly applying encoders to sparse radar data, we propose a Radar Height Self-Attention module.<n>To reduce reliance on expensive point cloud, we propose a pseudo-label generation pipeline based on an open-set segmentor.
arXiv Detail & Related papers (2025-01-26T03:51:56Z)
MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms [2.4775350526606355]
Vision Language Place Recognition (VLVPR) enhances robot localization performance by incorporating natural language descriptions from images.<n>By utilizing language information, VLVPR directs robot place matching, overcoming the constraint of solely depending on vision.<n>This paper proposes a novel coarse to fine and end to end connected cross modal place recognition framework, called MambaPlace.
arXiv Detail & Related papers (2024-08-28T12:06:11Z)
Local All-Pair Correspondence for Point Tracking [59.76186266230608]
We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.
arXiv Detail & Related papers (2024-07-22T06:49:56Z)
Instance-free Text to Point Cloud Localization with Relative Position Awareness [37.22900045434484]
Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. We address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation.
arXiv Detail & Related papers (2024-04-27T09:46:49Z)
Bridging the Gap Between End-to-End and Two-Step Text Spotting [88.14552991115207]
Bridging Text Spotting is a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods. We demonstrate the effectiveness of the proposed method through extensive experiments.
arXiv Detail & Related papers (2024-04-06T13:14:04Z)
Multiway Point Cloud Mosaicking with Diffusion and Global Optimization [74.3802812773891]
We introduce a novel framework for multiway point cloud mosaicking (named Wednesday) At the core of our approach is ODIN, a learned pairwise registration algorithm that identifies overlaps and refines attention scores. Tested on four diverse, large-scale datasets, our method state-of-the-art pairwise and rotation registration results by a large margin on all benchmarks.
arXiv Detail & Related papers (2024-03-30T17:29:13Z)
Point Cloud Mamba: Point Cloud Learning via State Space Model [73.7454734756626]
We show that Mamba-based point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs) In particular, we demonstrate that Mamba-based point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs) Point Cloud Mamba surpasses the state-of-the-art (SOTA) point-based method PointNeXt and achieves new SOTA performance on the ScanNN, ModelNet40, ShapeNetPart, and S3DIS datasets.
arXiv Detail & Related papers (2024-03-01T18:59:03Z)
AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation. We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z)
Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition [6.9545038359818445]
We propose a novel architecture for a long-range (1m - 2m) gesture recognition solution. We use a point cloud-based cross-learning approach from camera point cloud to 60-GHz FMCW radar point cloud. In the experimental results section, we demonstrate our model's overall accuracy of 98.4% for five gestures and its generalization capability.
arXiv Detail & Related papers (2022-03-31T14:34:36Z)
Point Cloud Segmentation Using Sparse Temporal Local Attention [30.969737698335944]
We propose a novel Sparse Temporal Local Attention (STELA) module which aggregates intermediate features from a local neighbourhood in previous point cloud frames. We achieve a competitive mIoU of 64.3% on the SemanticKitti dataset, and demonstrate significant improvement over the single-frame baseline.
arXiv Detail & Related papers (2021-12-01T06:00:50Z)
Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling [52.464516118826765]
We introduce RandLA-Net, an efficient and lightweight neural architecture to infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Our RandLA-Net can process 1 million points in a single pass up to 200x faster than existing approaches.
arXiv Detail & Related papers (2021-07-06T05:08:34Z)
SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud based Place Recognition [50.9889997200743]
We tackle the problem of place recognition from point cloud data with a self-attention and orientation encoding network (SOE-Net) SOE-Net fully explores the relationship between points and incorporates long-range context into point-wise local descriptors. Experiments on various benchmark datasets demonstrate superior performance of the proposed network over the current state-of-the-art approaches.
arXiv Detail & Related papers (2020-11-24T22:28:25Z)
InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling. Coarse predictions are generated in the first stage via a voxel-based region proposal network. Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z)
RPM-Net: Robust Point Matching using Learned Features [79.52112840465558]
RPM-Net is a less sensitive and more robust deep learning-based approach for rigid point cloud registration. Unlike some existing methods, our RPM-Net handles missing correspondences and point clouds with partial visibility.
arXiv Detail & Related papers (2020-03-30T13:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.