UniSOT: A Unified Framework for Multi-Modality Single Object Tracking
- URL: http://arxiv.org/abs/2511.01427v1
- Date: Mon, 03 Nov 2025 10:23:53 GMT
- Title: UniSOT: A Unified Framework for Multi-Modality Single Object Tracking
- Authors: Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Xu Zhou, Feng Wu,
- Abstract summary: We present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters.<n>UniSOT shows superior performance against modality-specific counterparts.<n> Notably, UniSOT outperforms previous counterparts by over 3.0% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0% main metric across all three RGB+X video modalities.
- Score: 60.21741689231572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0\% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0\% main metric across all three RGB+X video modalities.
Related papers
- UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking [40.8191099453086]
We introduce a novel multi-modal tracking framework based on a mamba-style state, termed UBATrack.<n>UBATrack comprises two simple yet effective work space: a S-temporal Mamba Adapter (MA) and a Dynamic Multi-modal Feature Mixer.<n>Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks.
arXiv Detail & Related papers (2026-01-21T09:24:19Z) - SwiTrack: Tri-State Switch for Cross-Modal Object Tracking [74.15663758681849]
Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities.<n>We propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams.
arXiv Detail & Related papers (2025-11-20T10:52:54Z) - Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm [103.36490810025752]
Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal.<n>This work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR)<n>We propose a novel multi-modal tracker, dubbed RDTTrack, which integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model.
arXiv Detail & Related papers (2025-09-29T13:05:15Z) - Towards Universal Modal Tracking with Online Dense Temporal Token Learning [66.83607018706519]
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning.<n>We expand the model's inputs to a video sequence level, aiming to see a richer video context from a near-global perspective.
arXiv Detail & Related papers (2025-07-27T08:47:42Z) - Unifying Visual and Vision-Language Tracking via Contrastive Learning [34.49865598433915]
Single object tracking aims to locate the target object in a video sequence according to different modal references.
Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings.
We present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings.
arXiv Detail & Related papers (2024-01-20T13:20:54Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.