Related papers: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

URL: http://arxiv.org/abs/2507.07982v1
Date: Thu, 10 Jul 2025 17:55:08 GMT
Title: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Authors: Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian,
Abstract summary: We propose Geometry Forcing to bridge the gap between video diffusion models and the underlying 3D nature of the physical world.<n>Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model.<n>We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks.
Score: 29.723534231743038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Related papers

Perspective from a Higher Dimension: Can 3D Geometric Priors Help Visual Floorplan Localization? [8.82283453148819]
Self-localization of building's floorplans has attracted researchers' interest.<n>Since floorplans are minimalist representations of a building's structure, modal and geometric differences between visual perceptions and floorplans pose challenges to this task.<n>Existing methods cleverly utilize 2D geometric features and pose filters to achieve promising performance.<n>This paper views the 2D Floorplan localization problem from a higher dimension by injecting 3D geometric priors into the visual FLoc algorithm.
arXiv Detail & Related papers (2025-07-25T01:34:26Z)
Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image [10.36303976374455]
Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference.<n>We present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training.<n>Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction.
arXiv Detail & Related papers (2025-06-26T11:22:06Z)
UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation [63.90470530428842]
In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation.<n>Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks.
arXiv Detail & Related papers (2025-05-30T12:31:59Z)
MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance [21.0593460047148]
We propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework.<n>Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation.<n>We show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling.
arXiv Detail & Related papers (2025-04-30T10:30:46Z)
Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.<n>Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z)
Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors [17.544733016978928]
3D object generation from a single image involves estimating the full 3D geometry and texture of unseen views from an unposed RGB image captured in the wild. Recent advancements in 3D object generation have introduced techniques that reconstruct an object's 3D shape and texture. We propose bridging the gap between 2D and 3D diffusion models to address this limitation.
arXiv Detail & Related papers (2024-10-12T10:14:11Z)
Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation [27.43973967994717]
MT3D is a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias.<n>By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects.
arXiv Detail & Related papers (2024-08-12T06:25:44Z)
A3D: Does Diffusion Dream about 3D Alignment? [73.97853402817405]
We tackle the problem of text-driven 3D generation from a geometry alignment perspective.<n>Given a set of text prompts, we aim to generate a collection of objects with semantically corresponding parts aligned across them.<n>We propose to embed these objects into a common latent space and optimize the continuous transitions between these objects.
arXiv Detail & Related papers (2024-06-21T09:49:34Z)
GeoGS3D: Single-view 3D Reconstruction via Geometric-aware Diffusion Model and Gaussian Splatting [81.03553265684184]
We introduce GeoGS3D, a framework for reconstructing detailed 3D objects from single-view images. We propose a novel metric, Gaussian Divergence Significance (GDS), to prune unnecessary operations during optimization. Experiments demonstrate that GeoGS3D generates images with high consistency across views and reconstructs high-quality 3D objects.
arXiv Detail & Related papers (2024-03-15T12:24:36Z)
Wonder3D: Single Image to 3D using Cross-Domain Diffusion [105.16622018766236]
Wonder3D is a novel method for efficiently generating high-fidelity textured meshes from single-view images. To holistically improve the quality, consistency, and efficiency of image-to-3D tasks, we propose a cross-domain diffusion model.
arXiv Detail & Related papers (2023-10-23T15:02:23Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
Joint Deep Multi-Graph Matching and 3D Geometry Learning from Inhomogeneous 2D Image Collections [57.60094385551773]
We propose a trainable framework for learning a deformable 3D geometry model from inhomogeneous image collections. We in addition obtain the underlying 3D geometry of the objects depicted in the 2D images.
arXiv Detail & Related papers (2021-03-31T17:25:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.