CoTracker: It is Better to Track Together
- URL: http://arxiv.org/abs/2307.07635v2
- Date: Tue, 26 Dec 2023 12:13:18 GMT
- Title: CoTracker: It is Better to Track Together
- Authors: Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova,
Andrea Vedaldi, Christian Rupprecht
- Abstract summary: CoTracker tracks dense points in a frame jointly across a video sequence.
We show that joint tracking results in a significantly higher tracking accuracy and robustness.
CoTracker operates causally on short windows, but is trained by unrolling the windows across longer video sequences.
- Score: 74.84109704301127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce CoTracker, a transformer-based model that tracks dense points in
a frame jointly across a video sequence. This differs from most existing
state-of-the-art approaches that track points independently, ignoring their
correlation. We show that joint tracking results in a significantly higher
tracking accuracy and robustness. We also provide several technical
innovations, including the concept of virtual tracks, which allows CoTracker to
track 70k points jointly and simultaneously. Furthermore, CoTracker operates
causally on short windows (hence, it is suitable for online tasks), but is
trained by unrolling the windows across longer video sequences, which enables
and significantly improves long-term tracking. We demonstrate qualitatively
impressive tracking results, where points can be tracked for a long time even
when they are occluded or leave the field of view. Quantitatively, CoTracker
outperforms all recent trackers on standard benchmarks, often by a substantial
margin.
Related papers
- Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding(JDE) trackers have demonstrated excellent performance in Multi-Object Tracking(MOT) tasks.
We propose a new learning approach using cross-correlation to capture temporal information of objects.
Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z) - OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning [33.521077115333696]
We present a general framework to unify various tracking tasks, termed as OneTracker.
OneTracker first performs a large-scale pre-training on a RGB tracker called Foundation Tracker.
Then we regard other modality information as prompt and build Prompt Tracker upon Foundation Tracker.
arXiv Detail & Related papers (2024-03-14T17:59:13Z) - Tracking with Human-Intent Reasoning [64.69229729784008]
This work proposes a new tracking task -- Instruction Tracking.
It involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames.
TrackGPT is capable of performing complex reasoning-based tracking.
arXiv Detail & Related papers (2023-12-29T03:22:18Z) - DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World
Videos [9.304179915575114]
DriveTrack is a new benchmark and data generation framework for keypoint tracking in real-world videos.
We release a dataset consisting of 1 billion point tracks across 24 hours of video, which is seven orders of magnitude greater than prior real-world benchmarks.
We show that fine-tuning keypoint trackers on DriveTrack improves accuracy on real-world scenes by up to 7%.
arXiv Detail & Related papers (2023-12-15T04:06:52Z) - Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object
Tracking [3.781471919731034]
Multi-object tracking (MOT) at low frame rates can reduce computational, storage and power overhead to better meet the constraints of edge devices.
We propose to explore collaborative tracking learning (ColTrack) for frame-rate-insensitive MOT in a query-based end-to-end manner.
arXiv Detail & Related papers (2023-08-11T02:25:58Z) - Tracking by Associating Clips [110.08925274049409]
In this paper, we investigate an alternative by treating object association as clip-wise matching.
Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips.
The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames.
Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching.
arXiv Detail & Related papers (2022-12-20T10:33:17Z) - CoCoLoT: Combining Complementary Trackers in Long-Term Visual Tracking [17.2557973738397]
We propose a framework, named CoCoLoT, that combines the characteristics of complementary visual trackers to achieve enhanced long-term tracking performance.
CoCoLoT perceives whether the trackers are following the target object through an online learned deep verification model, and accordingly activates a decision policy.
The proposed methodology is evaluated extensively and the comparison with several other solutions reveals that it competes favourably with the state-of-the-art on the most popular long-term visual tracking benchmarks.
arXiv Detail & Related papers (2022-05-09T13:25:13Z) - Real-time Online Multi-Object Tracking in Compressed Domain [66.40326768209]
Recent online Multi-Object Tracking (MOT) methods have achieved desirable tracking performance.
Inspired by the fact that the adjacent frames are highly relevant and redundant, we divide the frames into key and non-key frames.
Our tracker is about 6x faster while maintaining a comparable tracking performance.
arXiv Detail & Related papers (2022-04-05T09:47:24Z) - Learning to Track Objects from Unlabeled Videos [63.149201681380305]
In this paper, we propose to learn an Unsupervised Single Object Tracker (USOT) from scratch.
To narrow the gap between unsupervised trackers and supervised counterparts, we propose an effective unsupervised learning approach composed of three stages.
Experiments show that the proposed USOT learned from unlabeled videos performs well over the state-of-the-art unsupervised trackers by large margins.
arXiv Detail & Related papers (2021-08-28T22:10:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.