From Two-Stream to One-Stream: Efficient RGB-T Tracking via Mutual Prompt Learning and Knowledge Distillation
- URL: http://arxiv.org/abs/2403.16834v2
- Date: Sun, 7 Apr 2024 09:56:54 GMT
- Title: From Two-Stream to One-Stream: Efficient RGB-T Tracking via Mutual Prompt Learning and Knowledge Distillation
- Authors: Yang Luo, Xiqing Guo, Hao Li,
- Abstract summary: Inspired by visual prompt learning, we designed a novel two-stream RGB-T tracking architecture based on cross-modal mutual prompt learning.
Our designed teacher model achieved the highest precision rate, while the student model, with comparable precision rate to the teacher model, realized an inference speed more than three times faster than the teacher model.
- Score: 9.423279246172923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the complementary nature of visible light and thermal infrared modalities, object tracking based on the fusion of visible light images and thermal images (referred to as RGB-T tracking) has received increasing attention from researchers in recent years. How to achieve more comprehensive fusion of information from the two modalities at a lower cost has been an issue that researchers have been exploring. Inspired by visual prompt learning, we designed a novel two-stream RGB-T tracking architecture based on cross-modal mutual prompt learning, and used this model as a teacher to guide a one-stream student model for rapid learning through knowledge distillation techniques. Extensive experiments have shown that, compared to similar RGB-T trackers, our designed teacher model achieved the highest precision rate, while the student model, with comparable precision rate to the teacher model, realized an inference speed more than three times faster than the teacher model.(Codes will be available if accepted.)
Related papers
- Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer [13.887803692033073]
Facial Landmark Detection in thermal imagery is critical for applications in challenging lighting conditions.<n>We propose a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression.<n> Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods.
arXiv Detail & Related papers (2025-10-13T08:19:56Z) - 4D Contrastive Superflows are Dense 3D Representation Learners [62.433137130087445]
We introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing pretraining objectives.
To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances alignment of the knowledge distilled from camera views.
arXiv Detail & Related papers (2024-07-08T17:59:54Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Let All be Whitened: Multi-teacher Distillation for Efficient Visual
Retrieval [57.17075479691486]
We propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval.
Our source code is released at https://github.com/Maryeon/whiten_mtd.
arXiv Detail & Related papers (2023-12-15T11:43:56Z) - RGB-T Tracking via Multi-Modal Mutual Prompt Learning [5.301062575633768]
Object tracking based on the fusion of visible and thermal im-ages, known as RGB-T tracking, has gained increasing atten-tion from researchers in recent years.
We propose a tracking architecture based on mutual prompt learning between the two modalities.
We also design a lightweight prompter that incorporates attention mechanisms in two dimensions to transfer information from one modality to the other with lower computational costs.
arXiv Detail & Related papers (2023-08-31T01:13:01Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Lightning Fast Video Anomaly Detection via Adversarial Knowledge Distillation [106.42167050921718]
We propose a very fast frame-level model for anomaly detection in video.
It learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models.
Our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS.
arXiv Detail & Related papers (2022-11-28T17:50:19Z) - Students taught by multimodal teachers are superior action recognizers [41.821485757189656]
The focal point of egocentric video understanding is modelling hand-object interactions.
Standard models -- CNNs, Vision Transformers, etc. -- which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc.
The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time.
arXiv Detail & Related papers (2022-10-09T19:37:17Z) - Temporal Aggregation for Adaptive RGBT Tracking [14.00078027541162]
We propose an RGBT tracker which takes clues into account for robust appearance model learning.
Unlike most existing RGBT trackers that implement object tracking tasks with only spatial information included, temporal information is further considered in this method.
arXiv Detail & Related papers (2022-01-22T02:31:56Z) - Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking [39.505507508776404]
The target representation learned by convolutional neural networks plays an important role in Thermal Infrared (TIR) tracking.
We propose to distill representations of the TIR modality from the RGB modality with Cross-Modal Distillation (CMD)
Our tracker outperforms the baseline tracker by achieving absolute gains of 2.3% Success, 2.7% Precision, and 2.5% Normalized Precision respectively.
arXiv Detail & Related papers (2021-07-31T09:19:59Z) - Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking [85.333260415532]
We develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities.
When the appearance cue is unreliable, we take motion cues into account to make the tracker robust.
Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
arXiv Detail & Related papers (2020-07-04T08:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.