Visual and Memory Dual Adapter for Multi-Modal Object Tracking
- URL: http://arxiv.org/abs/2506.23972v1
- Date: Mon, 30 Jun 2025 15:38:26 GMT
- Title: Visual and Memory Dual Adapter for Multi-Modal Object Tracking
- Authors: Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu,
- Abstract summary: We propose a novel visual and memory dual adapter (VMDA) to construct more robust representations for multi-modal tracking.<n>We develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality.<n>We also design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations.
- Score: 34.406308400305385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.
Related papers
- Towards Universal Modal Tracking with Online Dense Temporal Token Learning [66.83607018706519]
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning.<n>We expand the model's inputs to a video sequence level, aiming to see a richer video context from a near-global perspective.
arXiv Detail & Related papers (2025-07-27T08:47:42Z) - Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking [9.353589376846902]
We propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network.<n>The source code and pre-trained models will be released at https://github.com/Event-AHU/Mamba_FETrack.
arXiv Detail & Related papers (2025-06-30T12:24:01Z) - Diff-MM: Exploring Pre-trained Text-to-Image Generation Model for Unified Multi-modal Object Tracking [45.341224888996514]
Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language.<n>Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data.<n>This work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model.
arXiv Detail & Related papers (2025-05-19T01:42:13Z) - XTrack: Multimodal Training Boosts RGB-X Video Object Trackers [88.72203975896558]
It is crucial to ensure that knowledge gained from multimodal sensing is effectively shared.<n>Similar samples across different modalities have more knowledge to share than otherwise.<n>We propose a method for RGB-X tracker during inference, with an average +3% precision improvement over the current SOTA.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking [19.50096632818305]
Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness.
Recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data.
We propose a novel symmetric multimodal tracking framework called SDSTrack.
arXiv Detail & Related papers (2024-03-24T04:15:50Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks.
ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z) - Prompting for Multi-Modal Tracking [70.0522146292258]
We propose a novel multi-modal prompt tracker (ProTrack) for multi-modal tracking.
ProTrack can transfer the multi-modal inputs to a single modality by the prompt paradigm.
Our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data.
arXiv Detail & Related papers (2022-07-29T09:35:02Z) - RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence
Loss [37.99375824040946]
We propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning.
Experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker.
arXiv Detail & Related papers (2020-11-14T01:50:46Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.