KV-Tracker: Real-Time Pose Tracking with Transformers
- URL: http://arxiv.org/abs/2512.22581v1
- Date: Sat, 27 Dec 2025 13:02:30 GMT
- Title: KV-Tracker: Real-Time Pose Tracking with Transformers
- Authors: Marwan Taher, Ignacio Alzugaray, Kirill Mazur, Xin Kong, Andrew J. Davison,
- Abstract summary: Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications.<n>We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos.
- Score: 30.32327636560028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $π^3$ with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.
Related papers
- Repurposing Video Diffusion Transformers for Robust Point Tracking [35.486648006768256]
Existing methods rely on shallow convolutional backbones such as ResNet that process frames independently.<n>We find that video Transformers (DiTs) inherently exhibit strong point tracking capability and robustly handle dynamic motions.<n>Our work validates video DiT features as an effective and efficient foundation for point tracking.
arXiv Detail & Related papers (2025-12-23T18:54:10Z) - Multi-View 3D Point Tracking [67.21282192436031]
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views.<n>Our model directly predicts 3D correspondences using a practical number of cameras.<n>We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks.
arXiv Detail & Related papers (2025-08-28T17:58:20Z) - SpatialTrackerV2: 3D Point Tracking Made Easy [73.0350898700048]
SpatialTrackerV2 is a feed-forward 3D point tracking method for monocular videos.<n>It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion.<n>By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%.
arXiv Detail & Related papers (2025-07-16T17:59:03Z) - St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World [106.91539872943864]
St4RTrack is a framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs.<n>We predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry.<n>We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
arXiv Detail & Related papers (2025-04-17T17:55:58Z) - VGGT: Visual Geometry Grounded Transformer [61.37669770946458]
VGGT is a feed-forward neural network that directly infers all key 3D attributes of a scene.<n>Network achieves state-of-the-art results in multiple 3D tasks.
arXiv Detail & Related papers (2025-03-14T17:59:47Z) - SIRE: SE(3) Intrinsic Rigidity Embeddings [16.630400019100943]
We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes.<n>Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss.<n>Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.
arXiv Detail & Related papers (2025-03-10T18:00:30Z) - DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction [65.46359561104867]
We target the challenge of online 2D and 3D point tracking from unposed monocular camera input.<n>We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion.<n>We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.
arXiv Detail & Related papers (2024-09-03T17:58:03Z) - Memory-based Adapters for Online 3D Scene Perception [71.71645534899905]
Conventional 3D scene perception methods are offline, i.e., take an already reconstructed 3D scene geometry as input.
We propose an adapter-based plug-and-play module for the backbone of 3D scene perception model.
Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks.
arXiv Detail & Related papers (2024-03-11T17:57:41Z) - Real-time 3D Deep Multi-Camera Tracking [13.494550690138775]
We propose a novel end-to-end tracking pipeline, Deep Multi-Camera Tracking (DMCT), which achieves reliable real-time multi-camera people tracking.
Our system achieves the state-of-the-art tracking results while maintaining real-time performance.
arXiv Detail & Related papers (2020-03-26T06:08:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.