TAP-Vid: A Benchmark for Tracking Any Point in a Video
- URL: http://arxiv.org/abs/2211.03726v2
- Date: Fri, 31 Mar 2023 11:51:40 GMT
- Title: TAP-Vid: A Benchmark for Tracking Any Point in a Video
- Authors: Carl Doersch, Ankush Gupta, Larisa Markeeva, Adri\`a Recasens, Lucas
Smaira, Yusuf Aytar, Jo\~ao Carreira, Andrew Zisserman, Yi Yang
- Abstract summary: We formalize the problem of tracking arbitrary physical points on surfaces over longer video clips, naming it tracking any point (TAP)
We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks.
We propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
- Score: 84.94877216665793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generic motion understanding from video involves not only tracking objects,
but also perceiving how their surfaces deform and move. This information is
useful to make inferences about 3D shape, physical properties and object
interactions. While the problem of tracking arbitrary physical points on
surfaces over longer video clips has received some attention, no dataset or
benchmark for evaluation existed, until now. In this paper, we first formalize
the problem, naming it tracking any point (TAP). We introduce a companion
benchmark, TAP-Vid, which is composed of both real-world videos with accurate
human annotations of point tracks, and synthetic videos with perfect
ground-truth point tracks. Central to the construction of our benchmark is a
novel semi-automatic crowdsourced pipeline which uses optical flow estimates to
compensate for easier, short-term motion like camera shake, allowing annotators
to focus on harder sections of video. We validate our pipeline on synthetic
data and propose a simple end-to-end point tracking model TAP-Net, showing that
it outperforms all prior methods on our benchmark when trained on synthetic
data.
Related papers
- Long-Term 3D Point Tracking By Cost Volume Fusion [2.3411633024711573]
We propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning.
Our model integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance.
arXiv Detail & Related papers (2024-07-18T09:34:47Z) - TAPVid-3D: A Benchmark for Tracking Any Point in 3D [63.060421798990845]
We introduce a new benchmark, TAPVid-3D, for evaluating the task of Tracking Any Point in 3D.
This benchmark will serve as a guidepost to improve our ability to understand precise 3D motion and surface deformation from monocular video.
arXiv Detail & Related papers (2024-07-08T13:28:47Z) - Dense Optical Tracking: Connecting the Dots [82.79642869586587]
DOT is a novel, simple and efficient method for solving the problem of point tracking in a video.
We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal trackers" like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker.
arXiv Detail & Related papers (2023-12-01T18:59:59Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving.
We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.