Epipolar Geometry Improves Video Generation Models
- URL: http://arxiv.org/abs/2510.21615v1
- Date: Fri, 24 Oct 2025 16:21:37 GMT
- Title: Epipolar Geometry Improves Video Generation Models
- Authors: Orest Kupyn, Fabian Manhardt, Federico Tombari, Christian Rupprecht,
- Abstract summary: 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks.<n>We explore how epipolar geometry constraints improve modern video diffusion models.<n>By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos.
- Score: 73.44978239787501
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite massive training data, these models fail to capture fundamental geometric principles underlying visual content. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable camera trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics, which produce noisy targets that compromise alignment quality. Training on static scenes with dynamic cameras ensures high-quality measurements while the model generalizes effectively to diverse dynamic content. By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos without compromising visual quality.
Related papers
- Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z) - VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation [34.46015478321541]
VideoGPA is a data-efficient self-supervised framework to automatically derive dense preference signals.<n>It steers the generative distribution toward inherent 3D consistency without requiring human annotations.<n>It significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs.
arXiv Detail & Related papers (2026-01-30T18:59:57Z) - GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z) - GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation [68.02988074681427]
Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content.<n>In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models.<n>Our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2025-11-28T13:55:45Z) - Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling [29.723534231743038]
We propose Geometry Forcing to bridge the gap between video diffusion models and the underlying 3D nature of the physical world.<n>Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model.<n>We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks.
arXiv Detail & Related papers (2025-07-10T17:55:08Z) - GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z) - UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation [63.90470530428842]
In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation.<n>Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks.
arXiv Detail & Related papers (2025-05-30T12:31:59Z) - DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion [53.70278210626701]
We propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images.<n>Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame.<n>We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches.
arXiv Detail & Related papers (2025-05-08T17:59:47Z) - Attention to Detail: Fine-Scale Feature Preservation-Oriented Geometric Pre-training for AI-Driven Surrogate Modeling [6.34618828355523]
AI-driven surrogate modeling has become an increasingly effective alternative to physics-based simulations for 3D design, analysis, and manufacturing.<n>This work introduces a self-supervised geometric representation learning method designed to capture fine-scale geometric features from non-parametric 3D models.
arXiv Detail & Related papers (2025-04-27T17:10:13Z) - Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis [31.90503003079933]
We introduce Dynamic Tetrahedra (DynTet), a novel hybrid representation that encodes explicit dynamic meshes by neural networks.
Compared with prior works, DynTet demonstrates significant improvements in fidelity, lip synchronization, and real-time performance according to various metrics.
arXiv Detail & Related papers (2024-02-27T09:56:15Z) - Wide-angle Image Rectification: A Survey [86.36118799330802]
wide-angle images contain distortions that violate the assumptions underlying pinhole camera models.
Image rectification, which aims to correct these distortions, can solve these problems.
We present a detailed description and discussion of the camera models used in different approaches.
Next, we review both traditional geometry-based image rectification methods and deep learning-based methods.
arXiv Detail & Related papers (2020-10-30T17:28:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.