MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms
- URL: http://arxiv.org/abs/2408.15740v3
- Date: Thu, 20 Feb 2025 07:50:16 GMT
- Title: MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms
- Authors: Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao,
- Abstract summary: Vision Language Place Recognition (VLVPR) enhances robot localization performance by incorporating natural language descriptions from images.<n>By utilizing language information, VLVPR directs robot place matching, overcoming the constraint of solely depending on vision.<n>This paper proposes a novel coarse to fine and end to end connected cross modal place recognition framework, called MambaPlace.
- Score: 2.4775350526606355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Language Place Recognition (VLVPR) enhances robot localization performance by incorporating natural language descriptions from images. By utilizing language information, VLVPR directs robot place matching, overcoming the constraint of solely depending on vision. The essence of multimodal fusion lies in mining the complementary information between different modalities. However, general fusion methods rely on traditional neural architectures and are not well equipped to capture the dynamics of cross modal interactions, especially in the presence of complex intra modal and inter modal correlations. To this end, this paper proposes a novel coarse to fine and end to end connected cross modal place recognition framework, called MambaPlace. In the coarse localization stage, the text description and 3D point cloud are encoded by the pretrained T5 and instance encoder, respectively. They are then processed using Text Attention Mamba (TAM) and Point Clouds Mamba (PCM) for data enhancement and alignment. In the subsequent fine localization stage, the features of the text description and 3D point cloud are cross modally fused and further enhanced through cascaded Cross Attention Mamba (CCAM). Finally, we predict the positional offset from the fused text point cloud features, achieving the most accurate localization. Extensive experiments show that MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset compared to the state of the art methods.
Related papers
- Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z) - Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection [16.398581898787608]
We propose a novel backbone, Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder.<n>Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window.<n>Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autore model.
arXiv Detail & Related papers (2026-02-23T06:03:07Z) - TextMamba: Scene Text Detector with Mamba [6.992080935409672]
We propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers.<n>We adopt the Top_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling.<n>Our method achieves state-of-the-art or competitive performance on various benchmarks.
arXiv Detail & Related papers (2025-12-07T05:06:19Z) - AtrousMamaba: An Atrous-Window Scanning Visual State Space Model for Remote Sensing Change Detection [29.004019252136565]
We propose a novel model, AtrousMamba, which balances the extraction of fine-grained local details with the integration of global contextual information.<n>By leveraging the atrous window scan visual state space (AWVSS) module, we design dedicated end-to-end Mamba-based frameworks for binary change detection (BCD) and semantic change detection (SCD)<n> Experimental results on six benchmark datasets show that the proposed framework outperforms existing CNN-based, Transformer-based, and Mamba-based methods.
arXiv Detail & Related papers (2025-07-22T02:36:16Z) - MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution [46.600316142855334]
Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions.<n>We propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution.<n>Our framework outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks.
arXiv Detail & Related papers (2025-06-17T13:35:06Z) - Text-Driven 3D Lidar Place Recognition for Autonomous Driving [2.3093110834423616]
We present Des4Pos, a novel two-stage text-driven remote sensing localization framework.
It attains a top-1 accuracy of 40% and a top-10 accuracy of 77% under a 5-meter radius threshold.
Experiments on the KITTI360Pose test set demonstrate that Des4Pos state-of-the-art performance in text-to-point-cloud place recognition.
arXiv Detail & Related papers (2025-03-23T11:36:19Z) - Spectral Informed Mamba for Robust Point Cloud Processing [17.74824534094739]
This paper introduces a new methodology leveraging Mamba and Masked Autoencoder networks for point cloud data.
We propose three key contributions to enhance Mamba's capability in processing complex point cloud structures.
arXiv Detail & Related papers (2025-03-06T20:32:59Z) - CrossOver: 3D Scene Cross-Modal Alignment [78.3057713547313]
CrossOver is a novel framework for cross-modal 3D scene understanding.
It learns a unified, modality-agnostic embedding space for scenes by aligning modalities.
It supports robust scene retrieval and object localization, even with missing modalities.
arXiv Detail & Related papers (2025-02-20T20:05:30Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - OverlapMamba: Novel Shift State Space Model for LiDAR-based Place Recognition [10.39935021754015]
We develop OverlapMamba, a novel network for place recognition as sequences.
Our method effectively detects loop closures showing even when traversing previously visited locations from different directions.
Relying on raw range view inputs, it outperforms typical LiDAR and multi-view combination methods in time complexity and speed.
arXiv Detail & Related papers (2024-05-13T17:46:35Z) - A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion [14.293042131263924]
In image fusion tasks, images from different sources possess distinct characteristics.
Mamba, as a state space model, has emerged in the field of natural language processing.
Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task.
arXiv Detail & Related papers (2024-04-14T16:09:33Z) - Point Cloud Mamba: Point Cloud Learning via State Space Model [73.7454734756626]
We show that Mamba-based point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs)
In particular, we demonstrate that Mamba-based point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs)
Point Cloud Mamba surpasses the state-of-the-art (SOTA) point-based method PointNeXt and achieves new SOTA performance on the ScanNN, ModelNet40, ShapeNetPart, and S3DIS datasets.
arXiv Detail & Related papers (2024-03-01T18:59:03Z) - Towards Compact 3D Representations via Point Feature Enhancement Masked
Autoencoders [52.66195794216989]
We propose Point Feature Enhancement Masked Autoencoders (Point-FEMAE) to learn compact 3D representations.
Point-FEMAE consists of a global branch and a local branch to capture latent semantic features.
Our method significantly improves the pre-training efficiency compared to cross-modal alternatives.
arXiv Detail & Related papers (2023-12-17T14:17:05Z) - Text2Loc: 3D Point Cloud Localization from Natural Language [49.01851743372889]
We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions.
We introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text.
Text2Loc improves the localization accuracy by up to $2times$ over the state-of-the-art on the KITTI360Pose dataset.
arXiv Detail & Related papers (2023-11-27T16:23:01Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.