Related papers: Repurposing Video Diffusion Transformers for Robust Point Tracking

Repurposing Video Diffusion Transformers for Robust Point Tracking

URL: http://arxiv.org/abs/2512.20606v1
Date: Tue, 23 Dec 2025 18:54:10 GMT
Title: Repurposing Video Diffusion Transformers for Robust Point Tracking
Authors: Soowon Son, Honggyu An, Chaehyun Kim, Hyunah Ko, Jisu Nam, Dahyun Chung, Siyoon Jin, Jung Yi, Jaewon Min, Junhwa Hur, Seungryong Kim,
Abstract summary: Existing methods rely on shallow convolutional backbones such as ResNet that process frames independently.<n>We find that video Transformers (DiTs) inherently exhibit strong point tracking capability and robustly handle dynamic motions.<n>Our work validates video DiT features as an effective and efficient foundation for point tracking.
Score: 35.486648006768256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.

Related papers

DiTraj: training-free trajectory control for video diffusion transformer [34.05715460730871]
Trajectory control represents a user-friendly task in controllable video generation.<n>We propose DiTraj, a training-free framework for trajectory control in text-to-video generation tailored for DiT.<n>Our method outperforms previous methods in both video quality and trajectory controllability.
arXiv Detail & Related papers (2025-09-26T03:53:31Z)
Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos [58.156141601478794]
Multi-object tracking (UAVT) aims to track multiple objects while maintaining consistent identities across frames of a given video.<n>Existing methods typically model motion cues and appearance separately, overlooking their interplay and resulting in suboptimal tracking performance.<n>We propose AMOT, which exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module.
arXiv Detail & Related papers (2025-08-03T12:06:47Z)
Emergent Temporal Correspondences from Video Diffusion Transformers [30.83001895223298]
We introduce DiffTrack, the first quantitative analysis framework designed to answer this question.<n>Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching.<n>We extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training.
arXiv Detail & Related papers (2025-06-20T17:59:55Z)
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World [106.91539872943864]
St4RTrack is a framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs.<n>We predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry.<n>We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
arXiv Detail & Related papers (2025-04-17T17:55:58Z)
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z)
Track-On: Transformer-based Online Point Tracking with Memory [34.744546679670734]
We introduce Track-On, a simple transformer-based model designed for online long-term point tracking.<n>Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames.<n>At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy.
arXiv Detail & Related papers (2025-01-30T17:04:11Z)
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video [36.27277088961125]
We present TAPTRv3, a point tracking framework that works fine in regular videos but tends to fail in long videos.<n>We use both spatial and temporal context to bring better feature querying along the spatial and temporal dimensions.<n> TAPTRv3 surpasses TAPTRv2 by a large margin on most of the challenging datasets and obtains state-of-the-art performance.
arXiv Detail & Related papers (2024-11-27T17:37:22Z)
Fast Encoder-Based 3D from Casual Videos via Point Track Processing [22.563073026889324]
We present TracksTo4D, a learning-based approach that enables inferring 3D structure and camera positions from dynamic content originating from casual videos. TracksTo4D is trained in an unsupervised way on a dataset of casual videos. Experiments show that TracksTo4D can reconstruct a temporal point cloud and camera positions of the underlying video with accuracy comparable to state-of-the-art methods.
arXiv Detail & Related papers (2024-04-10T15:37:00Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.<n>DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.<n>Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z)
TAP-Vid: A Benchmark for Tracking Any Point in a Video [84.94877216665793]
We formalize the problem of tracking arbitrary physical points on surfaces over longer video clips, naming it tracking any point (TAP) We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. We propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
arXiv Detail & Related papers (2022-11-07T17:57:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.