Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency
- URL: http://arxiv.org/abs/2512.13665v1
- Date: Mon, 15 Dec 2025 18:54:30 GMT
- Title: Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency
- Authors: Wenhan Chen, Sezer Karaoglu, Theo Gevers,
- Abstract summary: Grab-3D is a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency.<n>We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric head to explicitly inject 3D geometric awareness into temporal modeling.
- Score: 23.121660279216528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
Related papers
- GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z) - GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation [68.02988074681427]
Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content.<n>In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models.<n>Our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2025-11-28T13:55:45Z) - EA3D: Online Open-World 3D Object Extraction from Streaming Videos [55.48835711373918]
We present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction.<n>Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge.<n>A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding.
arXiv Detail & Related papers (2025-10-29T03:56:41Z) - FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction [13.098585993121722]
We present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch.<n>Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction.<n>Experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency.
arXiv Detail & Related papers (2025-09-25T22:24:23Z) - Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling [29.723534231743038]
We propose Geometry Forcing to bridge the gap between video diffusion models and the underlying 3D nature of the physical world.<n>Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model.<n>We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks.
arXiv Detail & Related papers (2025-07-10T17:55:08Z) - Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation [134.53804996949287]
We introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets.<n>Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured.<n>Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments.
arXiv Detail & Related papers (2025-04-25T17:22:05Z) - Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach [54.559847511280545]
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness.<n>To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space.<n>The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model.
arXiv Detail & Related papers (2025-02-05T21:49:06Z) - Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation [27.43973967994717]
MT3D is a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias.<n>By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects.
arXiv Detail & Related papers (2024-08-12T06:25:44Z) - GeoGS3D: Single-view 3D Reconstruction via Geometric-aware Diffusion Model and Gaussian Splatting [81.03553265684184]
We introduce GeoGS3D, a framework for reconstructing detailed 3D objects from single-view images.
We propose a novel metric, Gaussian Divergence Significance (GDS), to prune unnecessary operations during optimization.
Experiments demonstrate that GeoGS3D generates images with high consistency across views and reconstructs high-quality 3D objects.
arXiv Detail & Related papers (2024-03-15T12:24:36Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.