VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
- URL: http://arxiv.org/abs/2601.23286v1
- Date: Fri, 30 Jan 2026 18:59:57 GMT
- Title: VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
- Authors: Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang,
- Abstract summary: VideoGPA is a data-efficient self-supervised framework to automatically derive dense preference signals.<n>It steers the generative distribution toward inherent 3D consistency without requiring human annotations.<n>It significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs.
- Score: 34.46015478321541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.
Related papers
- Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z) - Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment [15.822150318879052]
We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment.<n>We train a lightweight feature adapter using a reprojection-based consistency loss.<n>This enables state-of-the-art performance in both NVS and camera pose estimation.
arXiv Detail & Related papers (2025-12-09T18:59:52Z) - GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z) - Epipolar Geometry Improves Video Generation Models [73.44978239787501]
3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks.<n>We explore how epipolar geometry constraints improve modern video diffusion models.<n>By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos.
arXiv Detail & Related papers (2025-10-24T16:21:37Z) - ShapeGen4D: Towards High Quality 4D Shape Generation from Videos [85.45517487721257]
We introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video.<n>Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization.
arXiv Detail & Related papers (2025-10-07T17:58:11Z) - RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation [75.61028930882144]
We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data.<n>We introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models.<n> RLGF substantially reduces geometric errors (e.g., VP error by 21%, Depth error by 57%) and dramatically improves 3D object detection mAP by 12.7%, narrowing the gap to real-data performance.
arXiv Detail & Related papers (2025-09-20T02:23:36Z) - UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation [63.90470530428842]
In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation.<n>Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks.
arXiv Detail & Related papers (2025-05-30T12:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.