HybridTM: Combining Transformer and Mamba for 3D Semantic Segmentation
- URL: http://arxiv.org/abs/2507.18575v1
- Date: Thu, 24 Jul 2025 16:48:50 GMT
- Title: HybridTM: Combining Transformer and Mamba for 3D Semantic Segmentation
- Authors: Xinyu Wang, Jinghua Hou, Zhe Liu, Yingying Zhu,
- Abstract summary: We propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation.<n>In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity.<n>Our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks.
- Score: 7.663855540620183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long-range dependencies in large-scale point clouds. While recent Mamba-based approaches offer efficient processing with linear complexity, they struggle with feature representation when extracting 3D features. However, effectively combining these complementary strengths remains an open challenge in this field. In this paper, we propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation. In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity, enabling simultaneous capture of long-range dependencies and fine-grained local features. Extensive experiments demonstrate the effectiveness and generalization of our HybridTM on diverse indoor and outdoor datasets. Furthermore, our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks. The code will be made available at https://github.com/deepinact/HybridTM.
Related papers
- VMatcher: State-Space Semi-Dense Local Feature Matching [0.0]
VMatcher is a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs.<n>VMatcher integrates Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism.
arXiv Detail & Related papers (2025-07-31T09:39:16Z) - MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture [12.168520751389622]
Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy.<n>This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear sequence modeling capabilities.<n>On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency.
arXiv Detail & Related papers (2025-07-06T14:52:26Z) - Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z) - MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection [4.757840725810513]
YOLO-series models have set strong benchmarks by balancing speed and accuracy.<n>Transformers have high computational complexity because of their self-attention mechanism.<n>We propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency.
arXiv Detail & Related papers (2025-06-04T07:46:24Z) - Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing [21.15110217419682]
We propose a lightweight Mamba-based binary neural network for efficient demosaicing of HybridEVS RAW images.<n>Bi-Mamba binarizes all projections while retaining the core Selective Scan in full precision.<n>We conduct quantitative and qualitative experiments to demonstrate the effectiveness of BMTNet in both performance and computational efficiency.
arXiv Detail & Related papers (2025-03-20T13:32:27Z) - UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection [64.65405058535262]
Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces.<n>Due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field.<n>Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, we propose a novel Unified Mamba (UniMamba)<n>Specifically, a UniMamba block is designed which mainly consists of locality modeling, Z-order serialization and local-global sequential aggregator.
arXiv Detail & Related papers (2025-03-15T06:22:31Z) - Multi-granular body modeling with Redundancy-Free Spatiotemporal Fusion for Text-Driven Motion Generation [10.843503146808839]
We introduce HiSTF Mamba, a framework with three parts: Dual-tial Mamba, Bi-Temporal Mamba and a Spatiotemporal Fusion Module (DSFM)<n>Experiments on the HumanML3D benchmark show that HiSTF Mamba performs well across several metrics, achieving high fidelity and tight semantic alignment between text and motion.
arXiv Detail & Related papers (2025-03-10T04:01:48Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.<n>The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.<n>We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Prototype Learning Guided Hybrid Network for Breast Tumor Segmentation in DCE-MRI [58.809276442508256]
We propose a hybrid network via the combination of convolution neural network (CNN) and transformer layers.
The experimental results on private and public DCE-MRI datasets demonstrate that the proposed hybrid network superior performance than the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-11T15:46:00Z) - HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait
Recognition with Hybrid Explorations [66.5809637340079]
We propose the first in-the-wild benchmark CCGait for cloth-changing gait recognition.
We exploit both temporal dynamics and the projected 2D information of 3D human meshes.
Our contributions are twofold: we provide a challenging benchmark CCGait that captures realistic appearance changes across an expanded and space.
arXiv Detail & Related papers (2023-12-30T16:12:13Z) - 3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers [101.44668514239959]
We propose a hybrid encoder-decoder framework that efficiently computes spatial and temporal attentions in parallel.
We also introduce a semantic clutter-background adversarial loss during training that aids in the region of mitochondria instances from the background.
arXiv Detail & Related papers (2023-03-21T17:58:49Z) - Hybrid Dual Mean-Teacher Network With Double-Uncertainty Guidance for
Semi-Supervised Segmentation of MRI Scans [11.762045723792266]
We present a Hybrid Dual Mean-Teacher (HD-Teacher) model with hybrid, semi-supervised, and multi-task learning to achieve highly effective semi-supervised segmentation.
arXiv Detail & Related papers (2023-03-09T09:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.