Visual Prompt Multi-Modal Tracking
- URL: http://arxiv.org/abs/2303.10826v2
- Date: Sat, 25 Mar 2023 02:29:48 GMT
- Title: Visual Prompt Multi-Modal Tracking
- Authors: Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu
- Abstract summary: Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks.
ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
- Score: 71.53972967568251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visible-modal object tracking gives rise to a series of downstream
multi-modal tracking tributaries. To inherit the powerful representations of
the foundation model, a natural modus operandi for multi-modal tracking is full
fine-tuning on the RGB-based parameters. Albeit effective, this manner is not
optimal due to the scarcity of downstream data and poor transferability, etc.
In this paper, inspired by the recent success of the prompt learning in
language models, we develop Visual Prompt multi-modal Tracking (ViPT), which
learns the modal-relevant prompts to adapt the frozen pre-trained foundation
model to various downstream multimodal tracking tasks. ViPT finds a better way
to stimulate the knowledge of the RGB-based model that is pre-trained at scale,
meanwhile only introducing a few trainable parameters (less than 1% of model
parameters). ViPT outperforms the full fine-tuning paradigm on multiple
downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event
tracking. Extensive experiments show the potential of visual prompt learning
for multi-modal tracking, and ViPT can achieve state-of-the-art performance
while satisfying parameter efficiency. Code and models are available at
https://github.com/jiawen-zhu/ViPT.
Related papers
- Middle Fusion and Multi-Stage, Multi-Form Prompts for Robust RGB-T Tracking [1.8843687952462744]
M3PT is a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome challenges.
Based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns.
arXiv Detail & Related papers (2024-03-27T02:06:25Z) - SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking [19.50096632818305]
Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness.
Recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data.
We propose a novel symmetric multimodal tracking framework called SDSTrack.
arXiv Detail & Related papers (2024-03-24T04:15:50Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face
Anti-Spoofing [19.142582966452935]
We investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth.
We propose the modality-asymmetric masked autoencoder (M$2$A$2$E) for multimodal FAS self-supervised pre-training without costly annotated labels.
arXiv Detail & Related papers (2023-02-11T17:02:34Z) - Prompting for Multi-Modal Tracking [70.0522146292258]
We propose a novel multi-modal prompt tracker (ProTrack) for multi-modal tracking.
ProTrack can transfer the multi-modal inputs to a single modality by the prompt paradigm.
Our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data.
arXiv Detail & Related papers (2022-07-29T09:35:02Z) - Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline [80.13652104204691]
In this paper, we construct a large-scale benchmark with high diversity for visible-thermal UAV tracking (VTUAV)
We provide a coarse-to-fine attribute annotation, where frame-level attributes are provided to exploit the potential of challenge-specific trackers.
In addition, we design a new RGB-T baseline, named Hierarchical Multi-modal Fusion Tracker (HMFT), which fuses RGB-T data in various levels.
arXiv Detail & Related papers (2022-04-08T15:22:33Z) - Robust Visual Object Tracking with Two-Stream Residual Convolutional
Networks [62.836429958476735]
We propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking.
Our TS-RCN can be integrated with existing deep learning based visual trackers.
To further improve the tracking performance, we adopt a "wider" residual network ResNeXt as its feature extraction backbone.
arXiv Detail & Related papers (2020-05-13T19:05:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.