ODTrack: Online Dense Temporal Token Learning for Visual Tracking
- URL: http://arxiv.org/abs/2401.01686v1
- Date: Wed, 3 Jan 2024 11:44:09 GMT
- Title: ODTrack: Online Dense Temporal Token Learning for Visual Tracking
- Authors: Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang,
Xianxian Li
- Abstract summary: ODTrack is a video-level tracking pipeline that densely associates contextual relationships of video frames in an online token propagation manner.
It achieves a new itSOTA performance on seven benchmarks, while running at real-time speed.
- Score: 22.628561792412686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online contextual reasoning and association across consecutive video frames
are critical to perceive instances in visual tracking. However, most current
top-performing trackers persistently lean on sparse temporal relationships
between reference and search frames via an offline mode. Consequently, they can
only interact independently within each image-pair and establish limited
temporal correlations. To alleviate the above problem, we propose a simple,
flexible and effective video-level tracking pipeline, named \textbf{ODTrack},
which densely associates the contextual relationships of video frames in an
online token propagation manner. ODTrack receives video frames of arbitrary
length to capture the spatio-temporal trajectory relationships of an instance,
and compresses the discrimination features (localization information) of a
target into a token sequence to achieve frame-to-frame association. This new
solution brings the following benefits: 1) the purified token sequences can
serve as prompts for the inference in the next video frame, whereby past
information is leveraged to guide future inference; 2) the complex online
update strategies are effectively avoided by the iterative propagation of token
sequences, and thus we can achieve more efficient model representation and
computation. ODTrack achieves a new \textit{SOTA} performance on seven
benchmarks, while running at real-time speed. Code and models are available at
\url{https://github.com/GXNU-ZhongLab/ODTrack}.
Related papers
- Track-On: Transformer-based Online Point Tracking with Memory [34.744546679670734]
We introduce Track-On, a simple transformer-based model designed for online long-term point tracking.
Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames.
At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy.
arXiv Detail & Related papers (2025-01-30T17:04:11Z) - Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.
Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - NextStop: An Improved Tracker For Panoptic LIDAR Segmentation Data [0.6144680854063939]
4D panoptic LiDAR segmentation is essential for scene understanding in autonomous driving and robotics.
Current methods, like 4D-PLS and 4D-STOP, use a tracking-by-detection methodology, employing deep learning networks to perform semantic and instance segmentation on each frame.
NextStop demonstrates enhanced tracking performance, particularly for small-sized objects like people and bicyclists, with fewer ID switches, earlier tracking initiation, and improved reliability in complex environments.
arXiv Detail & Related papers (2025-01-08T09:08:06Z) - Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack.
In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information.
These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z) - Explicit Visual Prompts for Visual Object Tracking [23.561539973210248]
textbfEVPTrack is a visual tracking framework that exploits explicit visual prompts between consecutive frames.
We show that our framework can achieve competitive performance at a real-time by exploiting both explicit and multi-scale information.
arXiv Detail & Related papers (2024-01-06T07:12:07Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Tracking by Associating Clips [110.08925274049409]
In this paper, we investigate an alternative by treating object association as clip-wise matching.
Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips.
The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames.
Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching.
arXiv Detail & Related papers (2022-12-20T10:33:17Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences.
A temporal assessment network is proposed which is able to capture the temporal coherence of target locations.
A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z) - Continuity-Discrimination Convolutional Neural Network for Visual Object
Tracking [150.51667609413312]
This paper proposes a novel model, named Continuity-Discrimination Convolutional Neural Network (CD-CNN) for visual object tracking.
To address this problem, CD-CNN models temporal appearance continuity based on the idea of temporal slowness.
In order to alleviate inaccurate target localization and drifting, we propose a novel notion, object-centroid.
arXiv Detail & Related papers (2021-04-18T06:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.