Related papers: Bi-directional Adapter for Multi-modal Tracking

Bi-directional Adapter for Multi-modal Tracking

URL: http://arxiv.org/abs/2312.10611v1
Date: Sun, 17 Dec 2023 05:27:31 GMT
Title: Bi-directional Adapter for Multi-modal Tracking
Authors: Bing Cao, Junliang Guo, Pengfei Zhu, Qinghua Hu
Abstract summary: We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
Score: 67.01179868400229
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Due to the rapid development of computer vision, single-modal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging sensor, multi-modal images (RGB, Infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen pre-trained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods. Our code is available: https://github.com/SparkTempest/BAT.

Related papers

MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data. We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
Heterogeneous Graph Transformer for Multiple Tiny Object Tracking in RGB-T Videos [31.910202172609313]
Existing multi-object tracking algorithms generally focus on single-modality scenes. We propose a novel framework called HGT-Track (Heterogeneous Graph Transformer based Multi-Tiny-Object Tracking) This paper introduces the first benchmark VT-Tiny-MOT (Visible-Thermal Tiny Multi-Object Tracking) for RGB-T fused multiple tiny object tracking.
arXiv Detail & Related papers (2024-12-14T15:17:49Z)
FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network [19.466279425330857]
We propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. Our work was submitted to ACM MM in April 2024, but was rejected.
arXiv Detail & Related papers (2024-07-23T02:27:52Z)
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images. We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges. Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z)
Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z)
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR. UniTR processes a variety of modalities with unified modeling and shared parameters. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z)
Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z)
Prompting for Multi-Modal Tracking [70.0522146292258]
We propose a novel multi-modal prompt tracker (ProTrack) for multi-modal tracking. ProTrack can transfer the multi-modal inputs to a single modality by the prompt paradigm. Our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data.
arXiv Detail & Related papers (2022-07-29T09:35:02Z)
Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking [23.130490413184596]
We introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion. Our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion.
arXiv Detail & Related papers (2022-03-30T13:00:27Z)
RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence Loss [37.99375824040946]
We propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning. Experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker.
arXiv Detail & Related papers (2020-11-14T01:50:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.