OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection
- URL: http://arxiv.org/abs/2509.09495v2
- Date: Mon, 06 Oct 2025 21:24:19 GMT
- Title: OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection
- Authors: Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany,
- Abstract summary: Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity.<n>We present OpenFake, a dataset specifically crafted for benchmarking against modern generative models with high realism.<n> Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set.
- Score: 6.949215801100937
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, we demonstrate that with high-quality and continually updated benchmarks, automatic deepfake detection is both feasible and effective in real-world settings.
Related papers
- Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection [29.26836532055224]
Existing detectors often overfit to generator-specific artifacts and remain sensitive to real-world degradations.<n>We propose Real-centric Envelope Modeling (REM), a new paradigm that shifts detection from learning generator artifacts to modeling the robust distribution of real images.<n>REM achieves an average improvement of 7.5% over state-of-the-art methods, and notably maintains exceptional generalization on the severely degraded RealChain benchmark.
arXiv Detail & Related papers (2025-12-24T04:41:04Z) - Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios [54.07895223545793]
This paper introduces the Real-World Robustness dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions.<n>RRDataset includes high-quality images from seven major scenarios.<n>We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study.
arXiv Detail & Related papers (2025-09-11T06:15:52Z) - So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection [75.79507634008631]
We introduce So-Fake-Set, a social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and imagery synthesized using 35 state-of-the-art generative models.<n>We present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales.
arXiv Detail & Related papers (2025-05-24T11:53:35Z) - Is Artificial Intelligence Generated Image Detection a Solved Problem? [16.9828631287303]
AIGIBench is a benchmark designed to rigorously evaluate the robustness and generalization capabilities of state-of-the-art AIGI detectors.<n>It simulates real-world challenges through four core tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing.<n>It includes 23 diverse fake image subsets that span both advanced and widely adopted image generation techniques, along with real-world samples collected from social media and AI art platforms.
arXiv Detail & Related papers (2025-05-18T10:00:39Z) - TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks [0.9870503213194768]
We introduce TrueFake, a large-scale benchmarking dataset of 600,000 images.<n>This dataset allows for rigorous evaluation of state-of-the-art fake image detectors under very realistic and challenging conditions.<n>We analyze how social media sharing impacts detection performance, and identify current most effective detection and training strategies.
arXiv Detail & Related papers (2025-04-29T11:33:52Z) - FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics [66.14786900470158]
We propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics.<n>FakeScope identifies AI-synthetic images with high accuracy and provides rich, interpretable, and query-driven forensic insights.<n>FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios.
arXiv Detail & Related papers (2025-03-31T16:12:48Z) - Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation [44.246370685732444]
We introduce FakeVLM, a specialized large multimodal model for both general synthetic image and DeepFake detection tasks.<n>FakeVLM excels in distinguishing real from fake images and provides clear, natural language explanations for image artifacts.<n>We also present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language.
arXiv Detail & Related papers (2025-03-19T05:14:44Z) - SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model [48.547599530927926]
Synthetic images, when shared on social media, can mislead extensive audiences and erode trust in digital content.<n>We introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages.<n>We propose a new image deepfake detection, localization, and explanation framework, named SIDA.
arXiv Detail & Related papers (2024-12-05T16:12:25Z) - Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook [101.30779332427217]
We survey deepfake generation and detection techniques, including the most recent developments in the field.<n>We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content.<n>We develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content.
arXiv Detail & Related papers (2024-11-29T08:29:25Z) - Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [88.398085358514]
Contrastive Deepfake Embeddings (CoDE) is a novel embedding space specifically designed for deepfake detection.
CoDE is trained via contrastive learning by additionally enforcing global-local similarities.
arXiv Detail & Related papers (2024-07-29T18:00:10Z) - Generalizable Synthetic Image Detection via Language-guided Contrastive Learning [22.533225521726116]
malevolent use of synthetic images, such as the dissemination of fake news or the creation of fake profiles, raises significant concerns regarding the authenticity of images.<n>We propose a simple yet very effective synthetic image detection method via a language-guided contrastive learning.<n>It is shown that our proposed LanguAge-guided SynThEsis Detection (LASTED) model achieves much improved generalizability to unseen image generation models.
arXiv Detail & Related papers (2023-05-23T08:13:27Z) - Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes.
We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection.
We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z) - Artificial Fingerprinting for Generative Models: Rooting Deepfake
Attribution in Training Data [64.65952078807086]
Photorealistic image generation has reached a new level of quality due to the breakthroughs of generative adversarial networks (GANs)
Yet, the dark side of such deepfakes, the malicious use of generated media, raises concerns about visual misinformation.
We seek a proactive and sustainable solution on deepfake detection by introducing artificial fingerprints into the models.
arXiv Detail & Related papers (2020-07-16T16:49:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.