EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
- URL: http://arxiv.org/abs/2602.17260v1
- Date: Thu, 19 Feb 2026 11:04:20 GMT
- Title: EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
- Authors: Hung Mai, Loi Dinh, Duc Hai Nguyen, Dat Do, Luong Doan, Khanh Nguyen Quoc, Huan Vu, Phong Ho, Naeem Ul Islam, Tuan Do,
- Abstract summary: We propose EA-Swin, an Embedding-Agnostic Swin model that integrates dependencies directly on video embeddings via factorized windowed attention.<n>We show that EA-Swin 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.80.9-20%) by a margin of 5-20%.
- Score: 2.1627654218138237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.
Related papers
- VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z) - VDOT: Efficient Unified Video Creation via Optimal Transport Distillation [70.02065520468726]
We propose an efficient unified video creation model, named VDOT.<n>We employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions.<n>To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering.
arXiv Detail & Related papers (2025-12-07T11:31:00Z) - Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z) - SAGA: Source Attribution of Generative AI Videos [23.217701516122048]
We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the need for AI-generated video source attribution at a large scale.<n>It provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights.
arXiv Detail & Related papers (2025-11-16T23:39:54Z) - AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences [41.66718802220536]
AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models.<n>We provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features.<n>Experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS.
arXiv Detail & Related papers (2025-08-14T15:55:49Z) - Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z) - Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation [31.737159092430108]
We study different generative architectures, searching and identifying discriminative features that are unbiased, robust to impairments, and shared across models.<n>We introduce a novel data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues.<n>Our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models.
arXiv Detail & Related papers (2025-06-20T07:36:59Z) - BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation [77.55074597806035]
GenBuster-200K is a large-scale, high-quality AI-generated video dataset featuring 200K high-resolution video clips.<n>BusterX is a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning.
arXiv Detail & Related papers (2025-05-19T02:06:43Z) - AVA: Towards Agentic Video Analytics with Vision Language Models [25.232366981384107]
We introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics.<n>Ava incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries.
arXiv Detail & Related papers (2025-05-01T02:40:23Z) - Detecting AI-Generated Video via Frame Consistency [25.290019967304616]
We propose an open-source dataset and a detection method for generated video for the first time.<n>First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions.<n>Second, we find via probing experiments that spatial artifact-based detectors lack generalizability.
arXiv Detail & Related papers (2024-02-03T08:52:06Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.