CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
- URL: http://arxiv.org/abs/2410.11831v1
- Date: Tue, 15 Oct 2024 17:56:32 GMT
- Title: CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
- Authors: Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht,
- Abstract summary: We introduce CoTracker3, comprising a new tracking model and a new semi-supervised training recipe.
This allows real videos without annotations to be used during training by generating pseudo-labels using off-the-shelf teachers.
The model is available in online and offline variants and reliably tracks visible and occluded points.
- Score: 63.90674869153876
- License:
- Abstract: Most state-of-the-art point trackers are trained on synthetic data due to the difficulty of annotating real videos for this task. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. In order to understand these issues better, we introduce CoTracker3, comprising a new tracking model and a new semi-supervised training recipe. This allows real videos without annotations to be used during training by generating pseudo-labels using off-the-shelf teachers. The new model eliminates or simplifies components from previous trackers, resulting in a simpler and often smaller architecture. This training scheme is much simpler than prior work and achieves better results using 1,000 times less data. We further study the scaling behaviour to understand the impact of using more real unsupervised data in point tracking. The model is available in online and offline variants and reliably tracks visible and occluded points.
Related papers
- Refining Pre-Trained Motion Models [56.18044168821188]
We take on the challenge of improving state-of-the-art supervised models with self-supervised training.
We focus on obtaining a "clean" training signal from real-world unlabelled video.
We show that our method yields reliable gains over fully-supervised methods in real videos.
arXiv Detail & Related papers (2024-01-01T18:59:33Z) - TAP-Vid: A Benchmark for Tracking Any Point in a Video [84.94877216665793]
We formalize the problem of tracking arbitrary physical points on surfaces over longer video clips, naming it tracking any point (TAP)
We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks.
We propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
arXiv Detail & Related papers (2022-11-07T17:57:02Z) - TDT: Teaching Detectors to Track without Fully Annotated Videos [2.8292841621378844]
One-stage trackers that predict both detections and appearance embeddings in one forward pass received much attention.
Our proposed one-stage solution matches the two-stage counterpart in quality but is 3 times faster.
arXiv Detail & Related papers (2022-05-11T15:56:17Z) - Learning to Track Objects from Unlabeled Videos [63.149201681380305]
In this paper, we propose to learn an Unsupervised Single Object Tracker (USOT) from scratch.
To narrow the gap between unsupervised trackers and supervised counterparts, we propose an effective unsupervised learning approach composed of three stages.
Experiments show that the proposed USOT learned from unlabeled videos performs well over the state-of-the-art unsupervised trackers by large margins.
arXiv Detail & Related papers (2021-08-28T22:10:06Z) - Multi-Object Tracking with Hallucinated and Unlabeled Videos [34.38275236770619]
In place of tracking annotations, we first hallucinate videos with bounding box annotations using zoom-in/out motion transformations.
We then mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.
Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets.
arXiv Detail & Related papers (2021-08-19T17:57:29Z) - Unsupervised Deep Representation Learning for Real-Time Tracking [137.69689503237893]
We propose an unsupervised learning method for visual tracking.
The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking.
We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning.
arXiv Detail & Related papers (2020-07-22T08:23:12Z) - Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment.
We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.