Related papers: TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

URL: http://arxiv.org/abs/2306.08637v2
Date: Wed, 30 Aug 2023 14:28:37 GMT
Title: TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Authors: Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman
Abstract summary: We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
Score: 64.11385310305612
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.

Related papers

Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one. We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z)
Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark [36.9654606035663]
We introduce a novel hierarchical knowledge distillation strategy to guide the learning of the student Transformer network. We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy. We propose EventVOT, the first large-scale high-resolution event-based tracking dataset.
arXiv Detail & Related papers (2025-02-08T13:59:52Z)
Track-On: Transformer-based Online Point Tracking with Memory [34.744546679670734]
We introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy.
arXiv Detail & Related papers (2025-01-30T17:04:11Z)
Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames. Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs. This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z)
PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z)
A Fast and Map-Free Model for Trajectory Prediction in Traffics [2.435517936694533]
This paper proposes an efficient trajectory prediction model that is not dependent on traffic maps. By comprehensively utilizing attention mechanism, LSTM, graph convolution network and temporal transformer, our model is able to learn rich dynamic and interaction information of all agents. Our model achieves the highest performance when comparing with existing map-free methods and also exceeds most map-based state-of-the-art methods on the Argoverse dataset.
arXiv Detail & Related papers (2023-07-19T08:36:31Z)
TAP-Vid: A Benchmark for Tracking Any Point in a Video [84.94877216665793]
We formalize the problem of tracking arbitrary physical points on surfaces over longer video clips, naming it tracking any point (TAP) We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. We propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
arXiv Detail & Related papers (2022-11-07T17:57:02Z)
Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [76.70603443624012]
We propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling. In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance. OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k.
arXiv Detail & Related papers (2022-03-22T18:37:11Z)
Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together. We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z)
Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS) We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.