Related papers: How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

URL: http://arxiv.org/abs/2602.07814v1
Date: Sun, 08 Feb 2026 04:36:13 GMT
Title: How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study
Authors: Simiao Ren, Yuchen Zhou, Xingyu Shen, Kidus Zewde, Tommy Duong, George Huang, Hatsanai, Tiangratanakul, Tsang, Ng, En Wei, Jiayu Xue,
Abstract summary: No universal winner exists, with detector rankings exhibiting substantial instability.<n>Our findings challenge the one-size-fits-all'' detector paradigm.
Score: 5.740397289924559
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$ρ$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $χ^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.

Related papers

Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z)
ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection [0.0]
ForensicFormer is a hierarchical framework that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning.<n>Our method maintains 86.8% average accuracy across seven diverse test sets, spanning traditional manipulations, GAN-generated images, and diffusion model outputs.
arXiv Detail & Related papers (2026-01-12T04:29:36Z)
Evaluating Anomaly Detectors for Simulated Highly Imbalanced Industrial Classification Problems [1.376408511310322]
This paper presents a comprehensive evaluation of anomaly detection algorithms using a problem-agnostic simulated dataset.<n>We benchmark 14 detectors across training datasets with anomaly rates between 0.05% and 20% and training sizes between 1 000 and 10 000.<n>Our findings reveal that the best detector is highly dependant on the total number of faulty examples in the training dataset.
arXiv Detail & Related papers (2025-12-07T03:49:54Z)
MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning [0.6475163438744868]
We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules.<n> visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence.<n>MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales.
arXiv Detail & Related papers (2025-10-20T14:40:26Z)
Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift [51.24522135151649]
Anomaly detection plays a crucial role in quality control for industrial applications.<n>Existing methods attempt to address domain shifts by training generalizable models.<n>Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.
arXiv Detail & Related papers (2025-03-19T05:25:52Z)
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models.<n>In this paper, we investigate how detection performance varies across model backbones, types, and datasets.<n>We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z)
Uncertainty Estimation for 3D Object Detection via Evidential Learning [63.61283174146648]
We introduce a framework for quantifying uncertainty in 3D object detection by leveraging an evidential learning loss on Bird's Eye View representations in the 3D detector. We demonstrate both the efficacy and importance of these uncertainty estimates on identifying out-of-distribution scenes, poorly localized objects, and missing (false negative) detections.
arXiv Detail & Related papers (2024-10-31T13:13:32Z)
Multimodal Attack Detection for Action Recognition Models [24.585379549997743]
We propose a novel universal detection method which is compatible with any action recognition model. Tested against four state-of-the-art attacks targeting four action recognition models, the proposed detector achieves an average AUC of 0.911 over 16 test cases.
arXiv Detail & Related papers (2024-04-13T01:31:25Z)
DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection [55.70982767084996]
A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark. We present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions. DeepfakeBench contains 15 state-of-the-art detection methods, 9CL datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations.
arXiv Detail & Related papers (2023-07-04T01:34:41Z)
Localization Uncertainty-Based Attention for Object Detection [8.154943252001848]
We propose a more efficient uncertainty-aware dense detector (UADET) that predicts four-directional localization uncertainties via Gaussian modeling. Experiments using the MS COCO benchmark show that our UADET consistently surpasses baseline FCOS, and that our best model, ResNext-64x4d-101-DCN, obtains a single model, single-scale AP of 48.3% on COCO test-dev.
arXiv Detail & Related papers (2021-08-25T04:32:39Z)
Robust and Accurate Object Detection via Adversarial Learning [111.36192453882195]
This work augments the fine-tuning stage for object detectors by exploring adversarial examples. Our approach boosts the performance of state-of-the-art EfficientDets by +1.1 mAP on the object detection benchmark.
arXiv Detail & Related papers (2021-03-23T19:45:26Z)
SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector. It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection. Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.