Related papers: How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

URL: http://arxiv.org/abs/2406.19568v2
Date: Sun, 05 Oct 2025 14:29:28 GMT
Title: How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
Authors: Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, Xiaojuan Qi,
Abstract summary: Learned 3D Evaluation (L3DE) is a method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies.<n>Confidence scores quantify the gap between real and synthetic videos in terms of 3D visual coherence.<n>L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies.
Score: 46.85336335756483
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Kling, Sora, and MiniMax) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies. Project page: https://justin-crchang.github.io/l3de-project-page/

Related papers

3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism [2.6197884751430327]
We develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure.<n>Our method, 3DSPA, is 3Dtemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation.<n> Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism.
arXiv Detail & Related papers (2026-02-23T21:00:48Z)
FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction [13.098585993121722]
We present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch.<n>Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction.<n>Experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency.
arXiv Detail & Related papers (2025-09-25T22:24:23Z)
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation [87.91642226587294]
Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data.<n>We propose a self-distillation framework that distills the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation.<n>Our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
arXiv Detail & Related papers (2025-09-23T17:58:01Z)
ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory [56.06314177428745]
We present ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction.<n>Our method generates robotic videos with autonomously planned 3D trajectories, significantly reducing human intervention requirements.
arXiv Detail & Related papers (2025-08-29T10:39:06Z)
GenWorld: Towards Detecting AI-generated Real-world Simulation Videos [79.98542193919957]
GenWorld is a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection.<n>We propose a model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection.
arXiv Detail & Related papers (2025-06-12T17:59:33Z)
Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis [45.64047250474718]
Despite advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect.
arXiv Detail & Related papers (2025-04-30T19:06:09Z)
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale [42.67300636733286]
We present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation.<n>The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data.<n>Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities.
arXiv Detail & Related papers (2024-12-09T17:44:56Z)
Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z)
SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix [60.48666051245761]
We propose a pose-free and training-free approach for generating 3D stereoscopic videos. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth. We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
arXiv Detail & Related papers (2024-06-29T08:33:55Z)
Splatter a Video: Video Gaussian Representation for Versatile Processing [48.9887736125712]
Video representation is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. We introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.
arXiv Detail & Related papers (2024-06-19T22:20:03Z)
Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos [16.34393937800271]
generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. We propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models.
arXiv Detail & Related papers (2024-06-13T21:52:49Z)
VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities. We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models. Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z)
Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features [21.583246378475856]
We introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet) We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos. DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.
arXiv Detail & Related papers (2024-05-24T08:26:04Z)
Sora Generates Videos with Stunning Geometrical Consistency [75.46675626542837]
We introduce a new benchmark that assesses the quality of the generated videos based on their adherence to real-world physics principles. We employ a method that transforms the generated videos into 3D models, leveraging the premise that the accuracy of 3D reconstruction is heavily contingent on the video quality.
arXiv Detail & Related papers (2024-02-27T10:49:05Z)
VGMShield: Mitigating Misuse of Video Generative Models [7.963591895964269]
We introduce VGMShield: a set of three straightforward but pioneering mitigations through the lifecycle of fake video generation. We first try to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos. Then, we investigate the textittracing problem, which maps a fake video back to a model that generates it.
arXiv Detail & Related papers (2024-02-20T16:39:23Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
3D-Aware Video Generation [149.5230191060692]
We explore 4D generative adversarial networks (GANs) that learn generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos.
arXiv Detail & Related papers (2022-06-29T17:56:03Z)
Detecting Deepfake Videos Using Euler Video Magnification [1.8506048493564673]
Deepfake videos are manipulating videos using advanced machine learning techniques. In this paper, we examine a technique for possible identification of deepfake videos. Our approach uses features extracted from the Euler technique to train three models to classify counterfeit and unaltered videos.
arXiv Detail & Related papers (2021-01-27T17:37:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.