Related papers: D3: Training-Free AI-Generated Video Detection Using Second-Order Features

D3: Training-Free AI-Generated Video Detection Using Second-Order Features

URL: http://arxiv.org/abs/2508.00701v2
Date: Tue, 05 Aug 2025 03:05:49 GMT
Title: D3: Training-Free AI-Generated Video Detection Using Second-Order Features
Authors: Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen,
Abstract summary: Detection by Difference of Differences (D3) is a novel training-free detection method for synthetic videos.<n>We validate the superiority of our D3 on 4 open-source datasets.
Score: 17.253600093886277
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3's exceptional computational efficiency and strong robust performance. Our code is available at https://github.com/Zig-HS/D3.

Related papers

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning [66.51617619673587]
We present Skyra, a specialized large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos.<n>To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video dataset with fine-grained human annotations.<n>We then develop a two-stage training strategy that systematically enhances our model's artifact's-temporal perception, explanation capability, and detection accuracy.
arXiv Detail & Related papers (2025-12-17T18:48:26Z)
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence [70.2803680525165]
We introduce Open-o3 Video, a non-agent framework that integrates explicit evidence into video reasoning.<n>The model highlights key objects and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations.<n>On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mL timestamp by 24.2%.
arXiv Detail & Related papers (2025-10-23T14:05:56Z)
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning [58.70446237944036]
DAVID-X is the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales.<n>We present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning.<n>Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.
arXiv Detail & Related papers (2025-06-13T13:39:53Z)
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation [77.55074597806035]
GenBuster-200K is a large-scale, high-quality AI-generated video dataset featuring 200K high-resolution video clips.<n>BusterX is a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning.
arXiv Detail & Related papers (2025-05-19T02:06:43Z)
Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection [14.586314545834934]
We propose a fine-grained deepfake video detection approach called FakeSTormer.<n>Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions.<n>We also propose a video-level synthesis strategy that generates pseudo-fake videos with subtle-temporal artifacts.
arXiv Detail & Related papers (2025-01-02T10:21:34Z)
Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning [41.30923253467854]
Temporal features can be complex and diverse.<n>Spatiotemporal models often lean heavily on one type of artifact and ignore the other.<n>Videos are naturally resource-intensive.
arXiv Detail & Related papers (2024-08-30T07:49:57Z)
What Matters in Detecting AI-Generated Videos like Sora? [51.05034165599385]
Gap between synthetic and real-world videos remains under-explored. In this study, we compare real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training.
arXiv Detail & Related papers (2024-06-27T23:03:58Z)
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.<n>Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.<n>It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z)
Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features [21.583246378475856]
We introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet) We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos. DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.
arXiv Detail & Related papers (2024-05-24T08:26:04Z)
OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average. Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z)
AI-Generated Video Detection via Spatio-Temporal Anomaly Learning [2.1210527985139227]
Users can easily create non-existent videos to spread false information. A large-scale generated video dataset (GVD) is constructed as a benchmark for model training and evaluation.
arXiv Detail & Related papers (2024-03-25T11:26:18Z)
Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.