Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking
- URL: http://arxiv.org/abs/2412.15691v1
- Date: Fri, 20 Dec 2024 09:10:17 GMT
- Title: Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking
- Authors: Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, Jian Yang,
- Abstract summary: We propose a unified multimodal spatial-temporal tracking approach named STTrack.
In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information.
These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
- Score: 53.33637391723555
- License:
- Abstract: Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. Code is available at: https://github.com/NJU-PCALab/STTrack.
Related papers
- Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking [5.746443489229576]
Key Frame Extraction (KFE) module leverages reinforcement learning to adaptively segment videos.
Intra-Frame Feature Fusion (IFF) module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects.
Our proposed tracker achieves impressive results on the MOT17 dataset.
arXiv Detail & Related papers (2025-01-17T11:36:38Z) - Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation [1.6584112749108326]
TCDSG, Temporally Consistent Dynamic Scene Graphs, is an end-to-end framework that detects, tracks, and links subject-object relationships across time.
Our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
arXiv Detail & Related papers (2024-12-03T20:19:20Z) - Multi-step Temporal Modeling for UAV Tracking [14.687636301587045]
We introduce MT-Track, a streamlined and efficient multi-step temporal modeling framework for enhanced UAV tracking.
We unveil a unique temporal correlation module that dynamically assesses the interplay between the template and search region features.
We propose a mutual transformer module to refine the correlation maps of historical and current frames by modeling the temporal knowledge in the tracking sequence.
arXiv Detail & Related papers (2024-03-07T09:48:13Z) - Temporal Adaptive RGBT Tracking with Modality Prompt [10.431364270734331]
RGBT tracking has been widely used in various fields such as robotics, processing, surveillance, and autonomous driving.
Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results.
These RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training.
arXiv Detail & Related papers (2024-01-02T15:20:50Z) - SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features [52.213656737672935]
SpikeMOT is an event-based multi-object tracker.
SpikeMOT uses spiking neural networks to extract sparsetemporal features from event streams associated with objects.
arXiv Detail & Related papers (2023-09-29T05:13:43Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Tracking Objects and Activities with Attention for Temporal Sentence
Grounding [51.416914256782505]
Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment.
We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
arXiv Detail & Related papers (2023-02-21T16:42:52Z) - Tracking by Associating Clips [110.08925274049409]
In this paper, we investigate an alternative by treating object association as clip-wise matching.
Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips.
The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames.
Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching.
arXiv Detail & Related papers (2022-12-20T10:33:17Z) - Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using
Spatial and Temporal Transformers [20.806348407522083]
MO3TR is an end-to-end online multi-object tracking framework.
It encodes object interactions into long-term temporal embeddings.
It tracks initiation and termination without the need for an explicit data association module.
arXiv Detail & Related papers (2021-03-27T07:23:38Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.