Related papers: IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

URL: http://arxiv.org/abs/2506.00979v1
Date: Sun, 01 Jun 2025 12:20:22 GMT
Title: IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection
Authors: Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng,
Abstract summary: We introduce IVY-FAKE, a novel, unified, and large-scale dataset for explainable multimodal AIGC detection.<n>We propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture.<n>Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks.
Score: 24.67072921674199
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at https://huggingface.co/datasets/AI-Safeguard/Ivy-Fake.

Related papers

Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection [95.08316274158165]
X-AIGD provides pixel-level, categorized annotations of perceptual artifacts, spanning low-level distortions, high-level semantics, and cognitive-level counterfactuals.<n>Existing AIGI detectors demonstrate negligible reliance on perceptual artifacts, even at the most basic distortion level.<n>Explicitly aligning model attention with artifact regions can increase the interpretability and generalization of detectors.
arXiv Detail & Related papers (2026-01-27T10:09:17Z)
Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective [80.10217707456046]
We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata.<n>We train a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags.<n>Our detectors deliver strong generalization to in-the-wild samples and robustness to common benign image perturbations.
arXiv Detail & Related papers (2025-12-05T11:53:18Z)
SAGA: Source Attribution of Generative AI Videos [23.217701516122048]
We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the need for AI-generated video source attribution at a large scale.<n>It provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights.
arXiv Detail & Related papers (2025-11-16T23:39:54Z)
From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection [19.240335260177382]
We introduce AIFo (Agent-based Image Forensics), a training-free framework that emulates human forensic investigation through multi-agent collaboration.<n>Unlike conventional methods, our framework employs a set of forensic tools, including reverse image search, metadata extraction, pre-trained classifiers, and VLM analysis.<n>Our comprehensive evaluation spans 6,000 images and challenges real-world scenarios, including images from modern generative platforms and diverse online sources.
arXiv Detail & Related papers (2025-10-31T18:36:49Z)
RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification [14.448350657613368]
RAVID is the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG)<n>Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning.<n> RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP.
arXiv Detail & Related papers (2025-08-05T23:10:56Z)
Text-Visual Semantic Constrained AI-Generated Image Quality Assessment [47.575342788480505]
We propose a unified framework to enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images.<n>Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules.<n>Tests conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-07-14T16:21:05Z)
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning [58.70446237944036]
DAVID-X is the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales.<n>We present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning.<n>Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.
arXiv Detail & Related papers (2025-06-13T13:39:53Z)
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics [66.14786900470158]
We propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics.<n>FakeScope identifies AI-synthetic images with high accuracy and provides rich, interpretable, and query-driven forensic insights.<n>FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios.
arXiv Detail & Related papers (2025-03-31T16:12:48Z)
FakeReasoning: Towards Generalizable Forgery Detection and Reasoning [24.8865218866598]
We propose modeling AI-generated image detection and explanation as a Forgery Detection and Reasoning task (FDR-Task)<n>We introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 100K images across 10 generative models.<n>We also propose FakeReasoning, a forgery detection and reasoning framework with two key components.
arXiv Detail & Related papers (2025-03-27T06:54:06Z)
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI [58.35348718345307]
Current efforts to distinguish between real and AI-generated images may lack generalization.<n>We propose a novel framework, Co-Spy, that first enhances existing semantic features.<n>We also create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models.
arXiv Detail & Related papers (2025-03-24T01:59:29Z)
Automated Processing of eXplainable Artificial Intelligence Outputs in Deep Learning Models for Fault Diagnostics of Large Infrastructures [13.422002958854936]
This work proposes a novel framework that combines post-hoc explanations with semi-supervised learning to automatically identify anomalous explanations.<n>The proposed framework is applied to drone-collected images of insulator shells for power grid infrastructure monitoring.<n>The average classification accuracy on two faulty classes is improved by 8% and maintenance operators are required to manually reclassify only 15% of the images.
arXiv Detail & Related papers (2025-03-19T16:57:00Z)
Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI) [5.8695051911828555]
Recent AI-generated image detection (AGID) methods include CNNDetection, NPR, DM Image Detection, Fake Image Detection, DIRE, LASTED, GAN Image Detection, AIDE, SSP, DRCT, RINE, OCC-CLIP, De-Fake, and Deep Fake Detection. We introduce the Visual Counter Turing Test (VCT2), a benchmark comprising 130K images generated by text-to-image models. We also evaluate the performance of the aforementioned AGID techniques on the VCT$2$ benchmark, highlighting their ineffectiveness in detecting AI-generated
arXiv Detail & Related papers (2024-11-24T06:03:49Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems. Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner. We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space. We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.