Bi-directional Adapter for Multi-modal Tracking
- URL: http://arxiv.org/abs/2312.10611v1
- Date: Sun, 17 Dec 2023 05:27:31 GMT
- Title: Bi-directional Adapter for Multi-modal Tracking
- Authors: Bing Cao, Junliang Guo, Pengfei Zhu, Qinghua Hu
- Abstract summary: We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
- Score: 67.01179868400229
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Due to the rapid development of computer vision, single-modal (RGB) object
tracking has made significant progress in recent years. Considering the
limitation of single imaging sensor, multi-modal images (RGB, Infrared, etc.)
are introduced to compensate for this deficiency for all-weather object
tracking in complex environments. However, as acquiring sufficient multi-modal
tracking data is hard while the dominant modality changes with the open
environment, most existing techniques fail to extract multi-modal complementary
information dynamically, yielding unsatisfactory tracking performance. To
handle this problem, we propose a novel multi-modal visual prompt tracking
model based on a universal bi-directional adapter, cross-prompting multiple
modalities mutually. Our model consists of a universal bi-directional adapter
and multiple modality-specific transformer encoder branches with sharing
parameters. The encoders extract features of each modality separately by using
a frozen pre-trained foundation model. We develop a simple but effective light
feature adapter to transfer modality-specific information from one modality to
another, performing visual feature prompt fusion in an adaptive manner. With
adding fewer (0.32M) trainable parameters, our model achieves superior tracking
performance in comparison with both the full fine-tuning methods and the prompt
learning-based methods. Our code is available:
https://github.com/SparkTempest/BAT.
Related papers
- DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter [27.594612913364447]
We introduce a novel dual-temporal architecture for multimodal tracking, dubbed DMTrack.<n>Designs achieve promising- multimodal tracking performance with merely bfbf0.93M trainable parameters.<n>Experiments on five benchmarks show that DMTrack achieves state-of-the-art results.
arXiv Detail & Related papers (2025-08-03T05:13:27Z) - Visual and Memory Dual Adapter for Multi-Modal Object Tracking [34.406308400305385]
We propose a novel visual and memory dual adapter (VMDA) to construct more robust representations for multi-modal tracking.<n>We develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality.<n>We also design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations.
arXiv Detail & Related papers (2025-06-30T15:38:26Z) - Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking [9.353589376846902]
We propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network.<n>The source code and pre-trained models will be released at https://github.com/Event-AHU/Mamba_FETrack.
arXiv Detail & Related papers (2025-06-30T12:24:01Z) - Diff-MM: Exploring Pre-trained Text-to-Image Generation Model for Unified Multi-modal Object Tracking [45.341224888996514]
Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language.<n>Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data.<n>This work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model.
arXiv Detail & Related papers (2025-05-19T01:42:13Z) - MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data.
We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Heterogeneous Graph Transformer for Multiple Tiny Object Tracking in RGB-T Videos [31.910202172609313]
Existing multi-object tracking algorithms generally focus on single-modality scenes.
We propose a novel framework called HGT-Track (Heterogeneous Graph Transformer based Multi-Tiny-Object Tracking)
This paper introduces the first benchmark VT-Tiny-MOT (Visible-Thermal Tiny Multi-Object Tracking) for RGB-T fused multiple tiny object tracking.
arXiv Detail & Related papers (2024-12-14T15:17:49Z) - FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network [19.466279425330857]
We propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone.
Our work was submitted to ACM MM in April 2024, but was rejected.
arXiv Detail & Related papers (2024-07-23T02:27:52Z) - DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with
Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images.
We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges.
Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks.
ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z) - Prompting for Multi-Modal Tracking [70.0522146292258]
We propose a novel multi-modal prompt tracker (ProTrack) for multi-modal tracking.
ProTrack can transfer the multi-modal inputs to a single modality by the prompt paradigm.
Our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data.
arXiv Detail & Related papers (2022-07-29T09:35:02Z) - Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking [23.130490413184596]
We introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion.
Our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion.
arXiv Detail & Related papers (2022-03-30T13:00:27Z) - RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence
Loss [37.99375824040946]
We propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning.
Experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker.
arXiv Detail & Related papers (2020-11-14T01:50:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.