Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers
- URL: http://arxiv.org/abs/2307.04129v2
- Date: Tue, 5 Sep 2023 02:09:30 GMT
- Title: Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers
- Authors: Zhiyu Zhu, Junhui Hou, and Dapeng Oliver Wu
- Abstract summary: We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
- Score: 58.802352477207094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of cross-modal object tracking from RGB
videos and event data. Rather than constructing a complex cross-modal fusion
network, we explore the great potential of a pre-trained vision Transformer
(ViT). Particularly, we delicately investigate plug-and-play training
augmentations that encourage the ViT to bridge the vast distribution gap
between the two modalities, enabling comprehensive cross-modal information
interaction and thus enhancing its ability. Specifically, we propose a mask
modeling strategy that randomly masks a specific modality of some tokens to
enforce the interaction between tokens from different modalities interacting
proactively. To mitigate network oscillations resulting from the masking
strategy and further amplify its positive effect, we then theoretically propose
an orthogonal high-rank loss to regularize the attention matrix. Extensive
experiments demonstrate that our plug-and-play training augmentation techniques
can significantly boost state-of-the-art one-stream and twostream trackers to a
large extent in terms of both tracking precision and success rate. Our new
perspective and findings will potentially bring insights to the field of
leveraging powerful pre-trained ViTs to model cross-modal data. The code will
be publicly available.
Related papers
- Multi-layer Learnable Attention Mask for Multimodal Tasks [2.378535917357144]
Learnable Attention Mask (LAM) strategically designed to globally regulate attention maps and prioritize critical tokens.
LAM adeptly captures associations between tokens in BERT-like transformer network.
Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT.
arXiv Detail & Related papers (2024-06-04T20:28:02Z) - GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer [44.44603063754173]
Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities.
We propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations.
We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process.
arXiv Detail & Related papers (2024-06-03T11:24:15Z) - Hyper-Transformer for Amodal Completion [82.4118011026855]
Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information.
We introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN)
This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks.
arXiv Detail & Related papers (2024-05-30T11:11:54Z) - Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT.
To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction.
Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Mutual Information Regularization for Weakly-supervised RGB-D Salient
Object Detection [33.210575826086654]
We present a weakly-supervised RGB-D salient object detection model via supervision.
We focus on effective multimodal representation learning via inter-modal mutual information regularization.
arXiv Detail & Related papers (2023-06-06T12:36:57Z) - Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution.
We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction.
Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust.
We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.