ODTrack: Online Dense Temporal Token Learning for Visual Tracking
- URL: http://arxiv.org/abs/2401.01686v1
- Date: Wed, 3 Jan 2024 11:44:09 GMT
- Title: ODTrack: Online Dense Temporal Token Learning for Visual Tracking
- Authors: Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang,
Xianxian Li
- Abstract summary: ODTrack is a video-level tracking pipeline that densely associates contextual relationships of video frames in an online token propagation manner.
It achieves a new itSOTA performance on seven benchmarks, while running at real-time speed.
- Score: 22.628561792412686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online contextual reasoning and association across consecutive video frames
are critical to perceive instances in visual tracking. However, most current
top-performing trackers persistently lean on sparse temporal relationships
between reference and search frames via an offline mode. Consequently, they can
only interact independently within each image-pair and establish limited
temporal correlations. To alleviate the above problem, we propose a simple,
flexible and effective video-level tracking pipeline, named \textbf{ODTrack},
which densely associates the contextual relationships of video frames in an
online token propagation manner. ODTrack receives video frames of arbitrary
length to capture the spatio-temporal trajectory relationships of an instance,
and compresses the discrimination features (localization information) of a
target into a token sequence to achieve frame-to-frame association. This new
solution brings the following benefits: 1) the purified token sequences can
serve as prompts for the inference in the next video frame, whereby past
information is leveraged to guide future inference; 2) the complex online
update strategies are effectively avoided by the iterative propagation of token
sequences, and thus we can achieve more efficient model representation and
computation. ODTrack achieves a new \textit{SOTA} performance on seven
benchmarks, while running at real-time speed. Code and models are available at
\url{https://github.com/GXNU-ZhongLab/ODTrack}.
Related papers
- Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.
Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.
We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation [61.59996525424585]
DIFFVSGG is an online VSGG solution that frames this task as an iterative scene graph update problem.
We unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding.
DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs.
arXiv Detail & Related papers (2025-03-18T06:49:51Z) - Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.
Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.
We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z) - Track-On: Transformer-based Online Point Tracking with Memory [34.744546679670734]
We introduce Track-On, a simple transformer-based model designed for online long-term point tracking.
Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames.
At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy.
arXiv Detail & Related papers (2025-01-30T17:04:11Z) - Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack.
In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information.
These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z) - ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking [0.5371337604556311]
Efficiently modeling-temporal relations of objects is a key challenge in visual object tracking (VOT)
Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked.
In this paper we present ACTrack, a new framework with additive pre-temporal tracking framework with large memory conditions. It preserves the quality and capabilities of the pre-trained backbone by freezing its parameters, and makes a trainable lightweight additive net to model temporal relations in tracking.
We design an additive siamese convolutional network to ensure the integrity of spatial features and temporal sequence
arXiv Detail & Related papers (2024-02-27T07:34:08Z) - Explicit Visual Prompts for Visual Object Tracking [23.561539973210248]
textbfEVPTrack is a visual tracking framework that exploits explicit visual prompts between consecutive frames.
We show that our framework can achieve competitive performance at a real-time by exploiting both explicit and multi-scale information.
arXiv Detail & Related papers (2024-01-06T07:12:07Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Tracking by Associating Clips [110.08925274049409]
In this paper, we investigate an alternative by treating object association as clip-wise matching.
Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips.
The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames.
Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching.
arXiv Detail & Related papers (2022-12-20T10:33:17Z) - Real-time Online Multi-Object Tracking in Compressed Domain [66.40326768209]
Recent online Multi-Object Tracking (MOT) methods have achieved desirable tracking performance.
Inspired by the fact that the adjacent frames are highly relevant and redundant, we divide the frames into key and non-key frames.
Our tracker is about 6x faster while maintaining a comparable tracking performance.
arXiv Detail & Related papers (2022-04-05T09:47:24Z) - Joint Feature Learning and Relation Modeling for Tracking: A One-Stream
Framework [76.70603443624012]
We propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling.
In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance.
OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k.
arXiv Detail & Related papers (2022-03-22T18:37:11Z) - End-to-end video instance segmentation via spatial-temporal graph neural
networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain.
Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step.
We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences.
A temporal assessment network is proposed which is able to capture the temporal coherence of target locations.
A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z) - Continuity-Discrimination Convolutional Neural Network for Visual Object
Tracking [150.51667609413312]
This paper proposes a novel model, named Continuity-Discrimination Convolutional Neural Network (CD-CNN) for visual object tracking.
To address this problem, CD-CNN models temporal appearance continuity based on the idea of temporal slowness.
In order to alleviate inaccurate target localization and drifting, we propose a novel notion, object-centroid.
arXiv Detail & Related papers (2021-04-18T06:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.