Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
- URL: http://arxiv.org/abs/2512.13281v3
- Date: Thu, 18 Dec 2025 03:51:23 GMT
- Title: Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
- Authors: Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin,
- Abstract summary: Video Reality Test is an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling.<n>Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds.
- Score: 48.99013330282699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.
Related papers
- Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning [66.51617619673587]
We present Skyra, a specialized large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos.<n>To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video dataset with fine-grained human annotations.<n>We then develop a two-stage training strategy that systematically enhances our model's artifact's-temporal perception, explanation capability, and detection accuracy.
arXiv Detail & Related papers (2025-12-17T18:48:26Z) - Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs [92.02382309654263]
We introduce DeeptraceReward, a benchmark that annotates human-perceived fake traces for video generation reward.<n>The dataset comprises 4.3K detailed annotations across 3.3 high-quality generated videos.<n>We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated.
arXiv Detail & Related papers (2025-09-26T17:59:54Z) - VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding [70.00000053847738]
Real visual understanding is essential for AI systems that interact with the physical world.<n>Current evaluations mostly use real-world videos similar to training data.<n>We propose negative-control tests using videos that depict physically impossible or logically inconsistent events.
arXiv Detail & Related papers (2025-05-02T15:58:38Z) - How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach [46.85336335756483]
Learned 3D Evaluation (L3DE) is a method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies.<n>Confidence scores quantify the gap between real and synthetic videos in terms of 3D visual coherence.<n>L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies.
arXiv Detail & Related papers (2024-06-27T23:03:58Z) - VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation [38.84663997781797]
We release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos.
Experiments show Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points.
arXiv Detail & Related papers (2024-06-21T15:43:46Z) - Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model [62.38322742493649]
We build a video VQA benchmark covering editing categories, i.e., effect, funny, meme, and game.
Most of the open-source video LMMs perform poorly on the benchmark, indicating a huge domain gap between edited short videos on social media and regular raw videos.
To improve the generalization ability of LMMs, we collect a training set for the proposed benchmark based on both Panda-70M/WebVid raw videos and small-scale TikTok/CapCut edited videos.
arXiv Detail & Related papers (2024-06-15T03:28:52Z) - Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos [16.34393937800271]
generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities.
Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples.
We propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models.
arXiv Detail & Related papers (2024-06-13T21:52:49Z) - Self-Supervised Video Forensics by Audio-Visual Anomaly Detection [19.842795378751923]
Manipulated videos often contain subtle inconsistencies between their visual and audio signals.
We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies.
We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound.
arXiv Detail & Related papers (2023-01-04T18:59:49Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.