Related papers: Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Towards Universal Modal Tracking with Online Dense Temporal Token Learning

URL: http://arxiv.org/abs/2507.20177v1
Date: Sun, 27 Jul 2025 08:47:42 GMT
Title: Towards Universal Modal Tracking with Online Dense Temporal Token Learning
Authors: Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji,
Abstract summary: We propose a universal video-level modality-awareness tracking model with online dense temporal token learning.<n>We expand the model's inputs to a video sequence level, aiming to see a richer video context from a near-global perspective.
Score: 66.83607018706519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\modaltracker} achieves a new \textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.

Related papers

Visual and Memory Dual Adapter for Multi-Modal Object Tracking [34.406308400305385]
We propose a novel visual and memory dual adapter (VMDA) to construct more robust representations for multi-modal tracking.<n>We develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality.<n>We also design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations.
arXiv Detail & Related papers (2025-06-30T15:38:26Z)
Test-Time Training Done Right [61.8429380523577]
Test-Time Training (TTT) models context by adapting part of the model's weights (referred to as fast weights) during inference.<n>Existing TTT methods struggled to show effectiveness in handling long-context data.<n>We develop Large Chunk Test-Time Training (LaCT) which improves hardware utilization by orders of magnitude.
arXiv Detail & Related papers (2025-05-29T17:50:34Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
Diff-MM: Exploring Pre-trained Text-to-Image Generation Model for Unified Multi-modal Object Tracking [45.341224888996514]
Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language.<n>Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data.<n>This work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model.
arXiv Detail & Related papers (2025-05-19T01:42:13Z)
Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a parameter-efficient Multi-modalpatio Ssupervised-Temporal Adapter (MSTA) to enhance alignment between textual and visual representations.<n>We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-Temporal learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z)
Routing with Self-Attention for Multimodal Capsule Networks [108.85007719132618]
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework. To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules. This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
arXiv Detail & Related papers (2021-12-01T19:01:26Z)
Video Moment Retrieval via Natural Language Queries [7.611718124254329]
We propose a novel method for video moment retrieval (VMR) that achieves state of the arts (SOTA) performance on R@1 metrics. Our model has a simple architecture, which enables faster training and inference while maintaining.
arXiv Detail & Related papers (2020-09-04T22:06:34Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.