Related papers: HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

URL: http://arxiv.org/abs/2505.20904v2
Date: Wed, 28 May 2025 08:36:38 GMT
Title: HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion
Authors: Guanghu Xie, Yonglong Zhang, Zhiduo Jiang, Yang Liu, Zongwu Xie, Baoshi Cao, Hong Liu,
Abstract summary: Transparent and reflective objects pose significant challenges for depth sensors.<n>We propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures.<n>We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models.
Score: 9.235004977824026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transparent and reflective objects pose significant challenges for depth sensors, resulting in incomplete depth information that adversely affects downstream robotic perception and manipulation tasks. To address this issue, we propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures. The encoder is based on a dual-branch CNN-Transformer framework, the bottleneck fusion module adopts a Transformer-Mamba architecture, and the decoder is built upon a multi-scale fusion module. We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models, marking the first application of the Mamba architecture in the field of transparent object depth completion and revealing its promising potential. Additionally, we design an innovative multi-scale fusion module for the decoder that combines channel attention, spatial attention, and multi-scale feature extraction techniques to effectively integrate multi-scale features through a down-fusion strategy. Extensive evaluations on multiple public datasets demonstrate that our model achieves state-of-the-art(SOTA) performance, validating the effectiveness of our approach.

Related papers

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation [91.45421429922506]
OneCAT is a unified multimodal model that seamlessly integrates understanding, generation, and editing.<n>Our framework eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference.
arXiv Detail & Related papers (2025-09-03T17:29:50Z)
CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification [12.959829835589453]
We propose Cross State Fusion Mamba (Camba) Network.<n>Specifically, we first design the preprocessing module of remote sensing image information for the needs of Mamba structure.<n> Secondly, a cross-state module based on Mamba operator is creatively designed to fully fuse the feature of the two modalities.
arXiv Detail & Related papers (2025-08-31T03:08:34Z)
An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas.<n>We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z)
Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification [60.9670254833103]
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras.<n>We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID.
arXiv Detail & Related papers (2024-12-23T03:19:19Z)
SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection [18.090706979440334]
Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. Current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network. In this paper, we introduce an accurate and efficient object detection method named SeaDATE.
arXiv Detail & Related papers (2024-10-15T07:26:39Z)
MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation [8.443065903814821]
This study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation. At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data. This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data.
arXiv Detail & Related papers (2024-10-15T00:52:16Z)
A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions. We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z)
Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion [18.138433117711177]
We propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking. The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities. Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with other state-of-the-art methods.
arXiv Detail & Related papers (2024-05-28T07:24:56Z)
FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model [35.57157248152558]
Current deep learning (DL) methods typically employ convolutional neural networks (CNNs) or Transformers for feature extraction and information integration. We propose FusionMamba, an innovative method for efficient remote sensing image fusion.
arXiv Detail & Related papers (2024-04-11T17:29:56Z)
Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z)
General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance. We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z)
Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention [12.167049432063132]
We present a hybrid model consisting of a convolutional encoder and a Transformer-based decoder to fuse multimodal images. A branch fusion module is designed to adaptively fuse the features of the two branches.
arXiv Detail & Related papers (2022-10-18T13:30:52Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
Multi-Scale Boosted Dehazing Network with Dense Feature Fusion [92.92572594942071]
We propose a Multi-Scale Boosted Dehazing Network with Dense Feature Fusion based on the U-Net architecture. We show that the proposed model performs favorably against the state-of-the-art approaches on the benchmark datasets as well as real-world hazy images.
arXiv Detail & Related papers (2020-04-28T09:34:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.