MambaLCT: Boosting Tracking via Long-term Context State Space Model
- URL: http://arxiv.org/abs/2412.13615v1
- Date: Wed, 18 Dec 2024 08:53:52 GMT
- Title: MambaLCT: Boosting Tracking via Long-term Context State Space Model
- Authors: Xiaohai Li, Bineng Zhong, Qihua Liang, Guorong Li, Zhiyi Mo, Shuxiang Song,
- Abstract summary: We propose MambaLCT, which constructs and utilizes target variation cues from the first frame to the current frame for robust tracking.
MambaLCT achieves new SOTA performance on six benchmarks while maintaining real-time running speeds.
- Score: 19.893090586286995
- License:
- Abstract: Effectively constructing context information with long-term dependencies from video sequences is crucial for object tracking. However, the context length constructed by existing work is limited, only considering object information from adjacent frames or video clips, leading to insufficient utilization of contextual information. To address this issue, we propose MambaLCT, which constructs and utilizes target variation cues from the first frame to the current frame for robust tracking. First, a novel unidirectional Context Mamba module is designed to scan frame features along the temporal dimension, gathering target change cues throughout the entire sequence. Specifically, target-related information in frame features is compressed into a hidden state space through selective scanning mechanism. The target information across the entire video is continuously aggregated into target variation cues. Next, we inject the target change cues into the attention mechanism, providing temporal information for modeling the relationship between the template and search frames. The advantage of MambaLCT is its ability to continuously extend the length of the context, capturing complete target change cues, which enhances the stability and robustness of the tracker. Extensive experiments show that long-term context information enhances the model's ability to perceive targets in complex scenarios. MambaLCT achieves new SOTA performance on six benchmarks while maintaining real-time running speeds.
Related papers
- Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking [5.746443489229576]
Key Frame Extraction (KFE) module leverages reinforcement learning to adaptively segment videos.
Intra-Frame Feature Fusion (IFF) module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects.
Our proposed tracker achieves impressive results on the MOT17 dataset.
arXiv Detail & Related papers (2025-01-17T11:36:38Z) - Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack.
In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information.
These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z) - MotionTrack: Learning Robust Short-term and Long-term Motions for
Multi-Object Tracking [56.92165669843006]
We propose MotionTrack, which learns robust short-term and long-term motions in a unified framework to associate trajectories from a short to long range.
For dense crowds, we design a novel Interaction Module to learn interaction-aware motions from short-term trajectories, which can estimate the complex movement of each target.
For extreme occlusions, we build a novel Refind Module to learn reliable long-term motions from the target's history trajectory, which can link the interrupted trajectory with its corresponding detection.
arXiv Detail & Related papers (2023-03-18T12:38:33Z) - Tracking Objects and Activities with Attention for Temporal Sentence
Grounding [51.416914256782505]
Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment.
We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
arXiv Detail & Related papers (2023-02-21T16:42:52Z) - PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards
Video Object Detection [28.879484515844375]
We introduce a progressive way to introduce both temporal information and spatial information for an integrated enhancement.
PTSEFormer follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset.
arXiv Detail & Related papers (2022-09-06T06:32:57Z) - Context-aware Visual Tracking with Joint Meta-updating [11.226947525556813]
We propose a context-aware tracking model to optimize the tracker over the representation space, which jointly meta-update both branches by exploiting information along the whole sequence.
The proposed tracking method achieves an EAO score of 0.514 on VOT2018 with the speed of 40FPS, demonstrating its capability of improving the accuracy and robustness of the underlying tracker with little speed drop.
arXiv Detail & Related papers (2022-04-04T14:16:00Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using
Spatial and Temporal Transformers [20.806348407522083]
MO3TR is an end-to-end online multi-object tracking framework.
It encodes object interactions into long-term temporal embeddings.
It tracks initiation and termination without the need for an explicit data association module.
arXiv Detail & Related papers (2021-03-27T07:23:38Z) - Learning to Track with Object Permanence [61.36492084090744]
We introduce an end-to-end trainable approach for joint object detection and tracking.
Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI, and MOT17 datasets.
arXiv Detail & Related papers (2021-03-26T04:43:04Z) - Visual Tracking by TridentAlign and Context Embedding [71.60159881028432]
We propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods.
The performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed.
arXiv Detail & Related papers (2020-07-14T08:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.