ViPE: Video Pose Engine for 3D Geometric Perception
- URL: http://arxiv.org/abs/2508.10934v1
- Date: Tue, 12 Aug 2025 18:39:13 GMT
- Title: ViPE: Video Pose Engine for 3D Geometric Perception
- Authors: Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, Sanja Fidler,
- Abstract summary: ViPE is a handy and versatile video processing engine.<n>It efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos.<n>We use ViPE to annotate a large-scale collection of videos.
- Score: 89.29576047606703
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.
Related papers
- SpatialVID: A Large-Scale Video Dataset with Spatial Annotations [58.01259302233675]
SpatialVID is a dataset of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.<n>We collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline.<n>A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.
arXiv Detail & Related papers (2025-09-11T17:59:31Z) - Multi-View 3D Point Tracking [67.21282192436031]
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views.<n>Our model directly predicts 3D correspondences using a practical number of cameras.<n>We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks.
arXiv Detail & Related papers (2025-08-28T17:58:20Z) - Dynamic Camera Poses and Where to Find Them [36.249380390918816]
We introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses.<n>For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion.<n>Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Towards Understanding Camera Motions in Any Video [80.223048294482]
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding.<n>CameraBench consists of 3,000 diverse internet videos annotated by experts through a rigorous quality control process.<n>One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers.
arXiv Detail & Related papers (2025-04-21T18:34:57Z) - From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos [71.22810401256234]
Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world.<n>Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects.<n>We introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale.
arXiv Detail & Related papers (2024-12-10T18:59:44Z) - Align3R: Aligned Monocular Depth Estimation for Dynamic Videos [50.28715151619659]
We propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video.<n>Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps.<n>Experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
arXiv Detail & Related papers (2024-12-04T07:09:59Z) - Video Depth without Video Models [34.11454612504574]
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame.<n>We show how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator.<n>Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets to depth snippets.
arXiv Detail & Related papers (2024-11-28T14:50:14Z) - Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.
Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z) - Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis [43.02778060969546]
We propose a controllable monocular dynamic view synthesis pipeline.
Our model does not require depth as input, and does not explicitly model 3D scene geometry.
We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
arXiv Detail & Related papers (2024-05-23T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.