RGB-T Tracking via Multi-Modal Mutual Prompt Learning
- URL: http://arxiv.org/abs/2308.16386v1
- Date: Thu, 31 Aug 2023 01:13:01 GMT
- Title: RGB-T Tracking via Multi-Modal Mutual Prompt Learning
- Authors: Yang Luo, Xiqing Guo, Hui Feng, Lei Ao
- Abstract summary: Object tracking based on the fusion of visible and thermal im-ages, known as RGB-T tracking, has gained increasing atten-tion from researchers in recent years.
We propose a tracking architecture based on mutual prompt learning between the two modalities.
We also design a lightweight prompter that incorporates attention mechanisms in two dimensions to transfer information from one modality to the other with lower computational costs.
- Score: 5.301062575633768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object tracking based on the fusion of visible and thermal im-ages, known as
RGB-T tracking, has gained increasing atten-tion from researchers in recent
years. How to achieve a more comprehensive fusion of information from the two
modalities with fewer computational costs has been a problem that re-searchers
have been exploring. Recently, with the rise of prompt learning in computer
vision, we can better transfer knowledge from visual large models to downstream
tasks. Considering the strong complementarity between visible and thermal
modalities, we propose a tracking architecture based on mutual prompt learning
between the two modalities. We also design a lightweight prompter that
incorporates attention mechanisms in two dimensions to transfer information
from one modality to the other with lower computational costs, embedding it
into each layer of the backbone. Extensive ex-periments have demonstrated that
our proposed tracking ar-chitecture is effective and efficient, achieving
state-of-the-art performance while maintaining high running speeds.
Related papers
- Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents [60.095739427926524]
Long videos, characterized by temporal and sparse task-relevant information, pose significant reasoning challenges for AI systems.<n>Inspired by human progressive visual cognition, we propose CogniGPT for efficient and reliable long video understanding.
arXiv Detail & Related papers (2025-09-29T15:42:55Z) - From Two-Stream to One-Stream: Efficient RGB-T Tracking via Mutual Prompt Learning and Knowledge Distillation [9.423279246172923]
Inspired by visual prompt learning, we designed a novel two-stream RGB-T tracking architecture based on cross-modal mutual prompt learning.
Our designed teacher model achieved the highest precision rate, while the student model, with comparable precision rate to the teacher model, realized an inference speed more than three times faster than the teacher model.
arXiv Detail & Related papers (2024-03-25T14:57:29Z) - Leveraging the Power of Data Augmentation for Transformer-based Tracking [64.46371987827312]
We propose two data augmentation methods customized for tracking.
First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples.
Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference.
arXiv Detail & Related papers (2023-09-15T09:18:54Z) - Re-mine, Learn and Reason: Exploring the Cross-modal Semantic
Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task.
We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z) - A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion
Recognition [24.02488085447691]
We introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition.
Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning.
arXiv Detail & Related papers (2022-11-16T19:00:23Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z) - Multi-View Fusion Transformer for Sensor-Based Human Activity
Recognition [15.845205542668472]
Sensor-based human activity recognition (HAR) aims to recognize human activities based on the availability of rich time-series data collected from multi-modal sensors such as accelerometers and gyroscopes.
Recent deep learning methods are focusing on one view of the data, i.e., the temporal view, while shallow methods tend to utilize the hand-craft features for recognition, e.g., the statistics view.
We propose a novel method, namely multi-view fusion transformer (MVFT) along with a novel attention mechanism.
arXiv Detail & Related papers (2022-02-16T07:15:22Z) - Temporal Aggregation for Adaptive RGBT Tracking [14.00078027541162]
We propose an RGBT tracker which takes clues into account for robust appearance model learning.
Unlike most existing RGBT trackers that implement object tracking tasks with only spatial information included, temporal information is further considered in this method.
arXiv Detail & Related papers (2022-01-22T02:31:56Z) - Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust
Road Extraction [110.61383502442598]
We introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet)
CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement.
Experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction.
arXiv Detail & Related papers (2021-11-30T04:30:10Z) - Robust Correlation Tracking via Multi-channel Fused Features and
Reliable Response Map [10.079856376445598]
This paper proposes a robust correlation tracking algorithm (RCT) based on two ideas.
First, we propose a method to fuse features in order to more naturally describe the gradient and color information of the tracked object.
Second, we present a novel strategy to significantly reduce noise in the response map and therefore ease the problem of model drift.
arXiv Detail & Related papers (2020-11-25T07:15:03Z) - Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking [85.333260415532]
We develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities.
When the appearance cue is unreliable, we take motion cues into account to make the tracker robust.
Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
arXiv Detail & Related papers (2020-07-04T08:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.