Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion
- URL: http://arxiv.org/abs/2405.03177v3
- Date: Sun, 22 Jun 2025 07:01:44 GMT
- Title: Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion
- Authors: Yunfeng Li, Bo Wang, Ye Li,
- Abstract summary: The main problem in RGB-T tracking is the correct and optimal merging of the cross-modal features of visible and thermal images.<n>CSTNet aims to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking.<n>CSTNet and CSTNet-small achieve real-time speeds of 21 fps and 33 fps on the Nvidia Jetson Xavier.
- Score: 4.963745612929956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The main problem in RGB-T tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model's ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose CSTNet. It uses the Vision Transformer (ViT) as the backbone and adds a Joint Spatial and Channel Fusion Module (JSCFM) and Spatial Fusion Module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multi-level spatial features. The SFM module includes a cross-attention-like architecture for cross modeling and joint learning of RGB and TIR features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance. To enhance practicality, we retrain the model without JSCFM and SFM modules and use CSNet as the pretraining weight, and propose CSTNet-small, which achieves 50% speedup with an average decrease of 1-2% in SR and PR performance. CSTNet and CSTNet-small achieve real-time speeds of 21 fps and 33 fps on the Nvidia Jetson Xavier, meeting actual deployment requirements. Code is available at https://github.com/LiYunfengLYF/CSTNet.
Related papers
- HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection [75.406055413928]
We propose a novel prompt-driven segment anything model (HyPSAM) for RGB-T SOD.<n> DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction.<n>Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-23T07:32:11Z) - TUNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion [11.878642970457646]
RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions.<n>Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs.<n>We propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion.
arXiv Detail & Related papers (2025-09-12T07:02:45Z) - Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation [31.147154902692748]
integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots.<n>We introduce CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach.<n> CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost.
arXiv Detail & Related papers (2025-06-22T01:53:11Z) - CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features [0.6143225301480709]
CSTrack focuses on modeling Compact Stemporal features to achieve simple yet effective tracking.<n>New method achieves new SOTA results on mainstream RGB-X benchmarks.
arXiv Detail & Related papers (2025-05-26T02:53:12Z) - Cross Fusion RGB-T Tracking with Bi-directional Adapter [8.425592063392857]
We propose a novel Cross Fusion RGB-T Tracking architecture (CFBT)
The effectiveness of CFBT relies on three newly designed cross-temporal information fusion modules.
Experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.
arXiv Detail & Related papers (2024-08-30T02:45:56Z) - CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes [0.0]
multimodal semantic segmentation methods suffer from high computational complexity and low inference speed.
We propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model.
CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed.
arXiv Detail & Related papers (2024-07-01T14:34:32Z) - WCCNet: Wavelet-integrated CNN with Crossmodal Rearranging Fusion for
Fast Multispectral Pedestrian Detection [16.43119521684829]
We propose a novel framework named WCCNet that is able to differentially extract rich features of different spectra with lower computational complexity.
Based on the well extracted features, we elaborately design the crossmodal rearranging fusion module (CMRF)
We conduct comprehensive evaluations on KAIST and FLIR benchmarks, in which WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy.
arXiv Detail & Related papers (2023-08-02T09:35:21Z) - Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation [19.41334573257174]
Traditional methods mostly use RGB images which are heavily affected by lighting conditions, eg, darkness.
Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation.
This work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation.
arXiv Detail & Related papers (2023-06-17T14:28:08Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient
object detection [12.126413875108993]
We propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection.
The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets.
arXiv Detail & Related papers (2022-04-12T07:37:39Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Middle-level Fusion for Lightweight RGB-D Salient Object Detection [81.43951906434175]
A novel lightweight RGB-D SOD model is presented in this paper.
With IMFF and L modules incorporated in the middle-level fusion structure, our proposed model has only 3.9M parameters and runs at 33 FPS.
The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods.
arXiv Detail & Related papers (2021-04-23T11:37:15Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - RGB-D Salient Object Detection with Cross-Modality Modulation and
Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD)
The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z) - Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking [85.333260415532]
We develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities.
When the appearance cue is unreliable, we take motion cues into account to make the tracker robust.
Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
arXiv Detail & Related papers (2020-07-04T08:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.