SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
- URL: http://arxiv.org/abs/2509.09676v1
- Date: Thu, 11 Sep 2025 17:59:31 GMT
- Title: SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
- Authors: Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao,
- Abstract summary: SpatialVID is a dataset of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.<n>We collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline.<n>A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.
- Score: 58.01259302233675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
Related papers
- Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting [64.64738535860351]
We present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations.<n>Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding.<n>By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
arXiv Detail & Related papers (2025-07-24T14:53:26Z) - Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation [66.95956271144982]
We present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image.<n>Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames.
arXiv Detail & Related papers (2025-06-04T17:59:04Z) - Dynamic Camera Poses and Where to Find Them [36.249380390918816]
We introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses.<n>For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion.<n>Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - 360 in the Wild: Dataset for Depth Prediction and View Synthesis [66.58513725342125]
We introduce a large scale 360$circ$ videos dataset in the wild.
This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide.
Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map.
arXiv Detail & Related papers (2024-06-27T05:26:38Z) - DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields [3.94718692655789]
DiVa-360 is a real-world 360 dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences.
We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture.
arXiv Detail & Related papers (2023-07-31T17:59:48Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Multi-View Video-Based 3D Hand Pose Estimation [11.65577683784217]
We present the Multi-View Video-Based 3D Hand dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels.
Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos.
Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand.
arXiv Detail & Related papers (2021-09-24T05:20:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.