X Modality Assisting RGBT Object Tracking
- URL: http://arxiv.org/abs/2312.17273v1
- Date: Wed, 27 Dec 2023 05:38:54 GMT
- Title: X Modality Assisting RGBT Object Tracking
- Authors: Zhaisheng Ding, Haiyan Li, Ruichao Hou, Yanyu Liu, Shidong Xie,
Dongming Zhou and Jinde Cao
- Abstract summary: We propose a novel X Modality Assisting Network (X-Net) to shed light on the impact of the fusion paradigm.
To tackle the feature learning hurdles stemming from significant differences between RGB and thermal modalities, a plug-and-play pixel-level generation module (PGM) is proposed.
We also propose a feature-level interaction module (FIM) that incorporates a mixed feature interaction transformer and a spatial-dimensional feature translation strategy.
- Score: 36.614908357546035
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Learning robust multi-modal feature representations is critical for boosting
tracking performance. To this end, we propose a novel X Modality Assisting
Network (X-Net) to shed light on the impact of the fusion paradigm by
decoupling the visual object tracking into three distinct levels, facilitating
subsequent processing. Firstly, to tackle the feature learning hurdles stemming
from significant differences between RGB and thermal modalities, a
plug-and-play pixel-level generation module (PGM) is proposed based on
self-knowledge distillation learning, which effectively generates X modality to
bridge the gap between the dual patterns while reducing noise interference.
Subsequently, to further achieve the optimal sample feature representation and
facilitate cross-modal interactions, we propose a feature-level interaction
module (FIM) that incorporates a mixed feature interaction transformer and a
spatial-dimensional feature translation strategy. Ultimately, aiming at random
drifting due to missing instance features, we propose a flexible online
optimized strategy called the decision-level refinement module (DRM), which
contains optical flow and refinement mechanisms. Experiments are conducted on
three benchmarks to verify that the proposed X-Net outperforms state-of-the-art
trackers.
Related papers
- PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Unified Single-Stage Transformer Network for Efficient RGB-T Tracking [47.88113335927079]
We propose a single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone.
With this structure, the network can extract fusion features of the template and search region under the mutual interaction of modalities.
Experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance while maintaining the fastest inference speed 84.2FPS.
arXiv Detail & Related papers (2023-08-26T05:09:57Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - An Efficient End-to-End Transformer with Progressive Tri-modal Attention
for Multi-modal Emotion Recognition [27.96711773593048]
We propose the multi-modal end-to-end transformer (ME2ET), which can effectively model the tri-modal features interaction.
At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy.
At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities.
arXiv Detail & Related papers (2022-09-20T14:51:38Z) - Modal-Adaptive Gated Recoding Network for RGB-D Salient Object Detection [2.9153096940947796]
We propose a novel gated recoding network (GRNet) to evaluate the information validity of the two modes.
A perception encoder is adopted to extract multi-level single-modal features.
A modal-adaptive gate unit is proposed to suppress the invalid information and transfer the effective modal features to the recoding mixer and the hybrid branch decoder.
arXiv Detail & Related papers (2021-08-13T15:08:21Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - RGB-D Salient Object Detection with Cross-Modality Modulation and
Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD)
The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.