TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement
- URL: http://arxiv.org/abs/2306.08637v2
- Date: Wed, 30 Aug 2023 14:28:37 GMT
- Title: TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement
- Authors: Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf
Aytar, Joao Carreira, Andrew Zisserman
- Abstract summary: We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
- Score: 64.11385310305612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel model for Tracking Any Point (TAP) that effectively tracks
any queried point on any physical surface throughout a video sequence. Our
approach employs two stages: (1) a matching stage, which independently locates
a suitable candidate point match for the query point on every other frame, and
(2) a refinement stage, which updates both the trajectory and query features
based on local correlations. The resulting model surpasses all baseline methods
by a significant margin on the TAP-Vid benchmark, as demonstrated by an
approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model
facilitates fast inference on long and high-resolution video sequences. On a
modern GPU, our implementation has the capacity to track points faster than
real-time, and can be flexibly extended to higher-resolution videos. Given the
high-quality trajectories extracted from a large dataset, we demonstrate a
proof-of-concept diffusion model which generates trajectories from static
images, enabling plausible animations. Visualizations, source code, and
pretrained models can be found on our project webpage.
Related papers
- Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark [36.9654606035663]
We introduce a novel hierarchical knowledge distillation strategy to guide the learning of the student Transformer network.
We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy.
We propose EventVOT, the first large-scale high-resolution event-based tracking dataset.
arXiv Detail & Related papers (2025-02-08T13:59:52Z) - Track-On: Transformer-based Online Point Tracking with Memory [34.744546679670734]
We introduce Track-On, a simple transformer-based model designed for online long-term point tracking.
Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames.
At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy.
arXiv Detail & Related papers (2025-01-30T17:04:11Z) - Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - TAP-Vid: A Benchmark for Tracking Any Point in a Video [84.94877216665793]
We formalize the problem of tracking arbitrary physical points on surfaces over longer video clips, naming it tracking any point (TAP)
We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks.
We propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
arXiv Detail & Related papers (2022-11-07T17:57:02Z) - Joint Feature Learning and Relation Modeling for Tracking: A One-Stream
Framework [76.70603443624012]
We propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling.
In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance.
OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k.
arXiv Detail & Related papers (2022-03-22T18:37:11Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.