AI-Generated Image Detection: An Empirical Study and Future Research Directions
- URL: http://arxiv.org/abs/2511.02791v1
- Date: Tue, 04 Nov 2025 18:13:48 GMT
- Title: AI-Generated Image Detection: An Empirical Study and Future Research Directions
- Authors: Nusrat Tasnim, Kutub Uddin, Khalid Mahmood Malik,
- Abstract summary: Threats posed by AI-generated media, particularly deepfakes, are raising significant challenges for forensics.<n>Several forensic methods have been proposed, they suffer from three critical gaps.<n>These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications.
- Score: 6.891145787446519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.
Related papers
- Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning [49.28548464288051]
Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text.<n>In intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model's robustness.<n>This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations.
arXiv Detail & Related papers (2026-01-16T16:05:49Z) - Multi-Layer Confidence Scoring for Detection of Out-of-Distribution Samples, Adversarial Attacks, and In-Distribution Misclassifications [2.4219039094115034]
We introduce Multi-Layer Analysis for Confidence Scoring (MACS)<n>We derive a score applicable for confidence estimation, detecting distributional shifts and adversarial attacks.<n>We achieve performances that surpass the state-of-the-art approaches in our experiments with the VGG16 and ViTb16 models.
arXiv Detail & Related papers (2025-12-22T15:25:10Z) - Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z) - Rethinking Evaluation of Infrared Small Target Detection [105.59753496831739]
This paper introduces a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation.<n>An open-source toolkit has be released to facilitate standardized benchmarking.
arXiv Detail & Related papers (2025-09-21T02:45:07Z) - METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z) - Methods and Trends in Detecting AI-Generated Images: A Comprehensive Review [0.17188280334580194]
Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs) have enabled the synthesis of high-quality multimedia data.<n>These advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm.<n>This survey provides a comprehensive review of state-of-the-art techniques for detecting and classifying synthetic images generated by advanced generative AI models.
arXiv Detail & Related papers (2025-02-21T03:16:18Z) - Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks [17.520137576423593]
We aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR)
We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them.
We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR.
arXiv Detail & Related papers (2024-08-29T17:55:07Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [89.92916473403108]
This paper proposes a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.<n>The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.<n>We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - An Empirical Evaluation on Robustness and Uncertainty of Regularization
Methods [43.25086015530892]
Deep neural networks (DNNs) behave fundamentally differently from humans.
They can easily change predictions when small corruptions such as blur are applied on the input.
They produce confident predictions on out-of-distribution samples (improper uncertainty measure)
arXiv Detail & Related papers (2020-03-09T01:15:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.