Bi-directional Adapter for Multi-modal Tracking
- URL: http://arxiv.org/abs/2312.10611v1
- Date: Sun, 17 Dec 2023 05:27:31 GMT
- Title: Bi-directional Adapter for Multi-modal Tracking
- Authors: Bing Cao, Junliang Guo, Pengfei Zhu, Qinghua Hu
- Abstract summary: We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
- Score: 67.01179868400229
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Due to the rapid development of computer vision, single-modal (RGB) object
tracking has made significant progress in recent years. Considering the
limitation of single imaging sensor, multi-modal images (RGB, Infrared, etc.)
are introduced to compensate for this deficiency for all-weather object
tracking in complex environments. However, as acquiring sufficient multi-modal
tracking data is hard while the dominant modality changes with the open
environment, most existing techniques fail to extract multi-modal complementary
information dynamically, yielding unsatisfactory tracking performance. To
handle this problem, we propose a novel multi-modal visual prompt tracking
model based on a universal bi-directional adapter, cross-prompting multiple
modalities mutually. Our model consists of a universal bi-directional adapter
and multiple modality-specific transformer encoder branches with sharing
parameters. The encoders extract features of each modality separately by using
a frozen pre-trained foundation model. We develop a simple but effective light
feature adapter to transfer modality-specific information from one modality to
another, performing visual feature prompt fusion in an adaptive manner. With
adding fewer (0.32M) trainable parameters, our model achieves superior tracking
performance in comparison with both the full fine-tuning methods and the prompt
learning-based methods. Our code is available:
https://github.com/SparkTempest/BAT.
Related papers
- FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network [19.466279425330857]
We propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone.
Our work was submitted to ACM MM in April 2024, but was rejected.
arXiv Detail & Related papers (2024-07-23T02:27:52Z) - DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with
Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images.
We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges.
Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks.
ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z) - Prompting for Multi-Modal Tracking [70.0522146292258]
We propose a novel multi-modal prompt tracker (ProTrack) for multi-modal tracking.
ProTrack can transfer the multi-modal inputs to a single modality by the prompt paradigm.
Our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data.
arXiv Detail & Related papers (2022-07-29T09:35:02Z) - Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking [23.130490413184596]
We introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion.
Our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion.
arXiv Detail & Related papers (2022-03-30T13:00:27Z) - RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence
Loss [37.99375824040946]
We propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning.
Experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker.
arXiv Detail & Related papers (2020-11-14T01:50:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.