Related papers: UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking

UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking

URL: http://arxiv.org/abs/2601.14799v1
Date: Wed, 21 Jan 2026 09:24:19 GMT
Title: UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking
Authors: Qihua Liang, Liang Chen, Yaozong Zheng, Jian Nong, Zhiyi Mo, Bineng Zhong,
Abstract summary: We introduce a novel multi-modal tracking framework based on a mamba-style state, termed UBATrack.<n>UBATrack comprises two simple yet effective work space: a S-temporal Mamba Adapter (MA) and a Dynamic Multi-modal Feature Mixer.<n>Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks.
Score: 40.8191099453086
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.

Related papers

UETrack: A Unified and Efficient Framework for Single Object Tracking [46.50641228786134]
UETrack is an efficient framework for single object tracking.<n>It efficiently handles multiple modalities including RGB, Depth, Thermal, Event, and Language.<n>It achieves a superior speed-accuracy trade-off compared to previous methods.
arXiv Detail & Related papers (2026-03-02T03:32:30Z)
Learning Frequency and Memory-Aware Prompts for Multi-Modal Object Tracking [74.15663758681849]
We present Learning Frequency and Memory-Aware Prompts, a dual-adapter framework that injects lightweight prompts into a frozen RGB tracker.<n>A frequency-guided visual adapter adaptively transfers complementary cues across modalities.<n>A multilevel memory adapter with short, long, and permanent memory stores, updates, and retrieves reliable temporal context.
arXiv Detail & Related papers (2025-06-30T15:38:26Z)
Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking [9.353589376846902]
We propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network.<n>The source code and pre-trained models will be released at https://github.com/Event-AHU/Mamba_FETrack.
arXiv Detail & Related papers (2025-06-30T12:24:01Z)
Diff-MM: Exploring Pre-trained Text-to-Image Generation Model for Unified Multi-modal Object Tracking [45.341224888996514]
Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language.<n>Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data.<n>This work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model.
arXiv Detail & Related papers (2025-05-19T01:42:13Z)
CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking [68.24998698508344]
We introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation.<n>Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models.<n>Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks.
arXiv Detail & Related papers (2025-05-02T13:26:23Z)
SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking [19.50096632818305]
Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. We propose a novel symmetric multimodal tracking framework called SDSTrack.
arXiv Detail & Related papers (2024-03-24T04:15:50Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z)
Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z)
Prompting for Multi-Modal Tracking [70.0522146292258]
We propose a novel multi-modal prompt tracker (ProTrack) for multi-modal tracking. ProTrack can transfer the multi-modal inputs to a single modality by the prompt paradigm. Our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data.
arXiv Detail & Related papers (2022-07-29T09:35:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.