X Modality Assisting RGBT Object Tracking
- URL: http://arxiv.org/abs/2312.17273v2
- Date: Mon, 24 Feb 2025 15:06:13 GMT
- Title: X Modality Assisting RGBT Object Tracking
- Authors: Zhaisheng Ding, Haiyan Li, Ruichao Hou, Yanyu Liu, Shidong Xie,
- Abstract summary: A novel X Modality Assisting Network (X-Net) is introduced, which explores the impact of the fusion paradigm by decoupling visual object tracking into three distinct levels.<n>X-Net achieves performance gains of 0.47%/1.2% in the average of precise rate and success rate.
- Score: 1.730147049648545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing robust multi-modal feature representations is crucial for enhancing object tracking performance. In pursuit of this objective, a novel X Modality Assisting Network (X-Net) is introduced, which explores the impact of the fusion paradigm by decoupling visual object tracking into three distinct levels, thereby facilitating subsequent processing. Initially, to overcome the challenges associated with feature learning due to significant discrepancies between RGB and thermal modalities, a plug-and-play pixel-level generation module (PGM) based on knowledge distillation learning is proposed. This module effectively generates the X modality, bridging the gap between the two patterns while minimizing noise interference. Subsequently, to optimize sample feature representation and promote cross-modal interactions, a feature-level interaction module (FIM) is introduced, integrating a mixed feature interaction transformer and a spatial dimensional feature translation strategy. Finally, to address random drifting caused by missing instance features, a flexible online optimization strategy called the decision-level refinement module (DRM) is proposed, which incorporates optical flow and refinement mechanisms. The efficacy of X-Net is validated through experiments on three benchmarks, demonstrating its superiority over state-of-the-art trackers. Notably, X-Net achieves performance gains of 0.47%/1.2% in the average of precise rate and success rate, respectively. Additionally, the research content, data, and code are pledged to be made publicly accessible at https://github.com/DZSYUNNAN/XNet.
Related papers
- CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection [7.262250906929891]
Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection.
To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations.
First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism.
Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery.
arXiv Detail & Related papers (2025-04-02T03:22:36Z) - SMamba: Sparse Mamba for Event-based Object Detection [17.141967728323714]
Transformer-based methods have achieved remarkable performance in event-based object detection, owing to the global modeling ability.
To mitigate cost, some researchers propose window attention based sparsification strategies to discard unimportant regions.
We propose Sparse Mamba, which performs adaptive sparsification to reduce computational effort while maintaining global modeling ability.
arXiv Detail & Related papers (2025-01-21T08:33:32Z) - Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection [10.353412441955436]
We propose the GL-DMNet, a novel dual mutual learning network with global-local awareness.
We present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities.
Our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of 3%.
arXiv Detail & Related papers (2025-01-03T05:37:54Z) - Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.
Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.
We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Unified Single-Stage Transformer Network for Efficient RGB-T Tracking [47.88113335927079]
We propose a single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone.
With this structure, the network can extract fusion features of the template and search region under the mutual interaction of modalities.
Experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance while maintaining the fastest inference speed 84.2FPS.
arXiv Detail & Related papers (2023-08-26T05:09:57Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - An Efficient End-to-End Transformer with Progressive Tri-modal Attention
for Multi-modal Emotion Recognition [27.96711773593048]
We propose the multi-modal end-to-end transformer (ME2ET), which can effectively model the tri-modal features interaction.
At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy.
At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities.
arXiv Detail & Related papers (2022-09-20T14:51:38Z) - Modal-Adaptive Gated Recoding Network for RGB-D Salient Object Detection [2.9153096940947796]
We propose a novel gated recoding network (GRNet) to evaluate the information validity of the two modes.
A perception encoder is adopted to extract multi-level single-modal features.
A modal-adaptive gate unit is proposed to suppress the invalid information and transfer the effective modal features to the recoding mixer and the hybrid branch decoder.
arXiv Detail & Related papers (2021-08-13T15:08:21Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - RGB-D Salient Object Detection with Cross-Modality Modulation and
Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD)
The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z) - Hierarchical Dynamic Filtering Network for RGB-D Salient Object
Detection [91.43066633305662]
The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information.
In this paper, we explore these issues from a new perspective.
We implement a kind of more flexible and efficient multi-scale cross-modal feature processing.
arXiv Detail & Related papers (2020-07-13T07:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.