Related papers: Zoom-In to Sort AI-Generated Images Out

Zoom-In to Sort AI-Generated Images Out

URL: http://arxiv.org/abs/2510.04225v1
Date: Sun, 05 Oct 2025 14:29:01 GMT
Title: Zoom-In to Sort AI-Generated Images Out
Authors: Yikun Ji, Yan Hong, Bowen Deng, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang,
Abstract summary: We propose ZoomIn, a two-stage forensic framework that improves both accuracy and interpretability.<n>To support training, we introduce MagniFake, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations.<n>Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.
Score: 34.49867697753459
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose ZoomIn, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict. To support training, we introduce MagniFake, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.

Related papers

Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection [95.08316274158165]
X-AIGD provides pixel-level, categorized annotations of perceptual artifacts, spanning low-level distortions, high-level semantics, and cognitive-level counterfactuals.<n>Existing AIGI detectors demonstrate negligible reliance on perceptual artifacts, even at the most basic distortion level.<n>Explicitly aligning model attention with artifact regions can increase the interpretability and generalization of detectors.
arXiv Detail & Related papers (2026-01-27T10:09:17Z)
SynMind: Reducing Semantic Hallucination in fMRI-Based Image Reconstruction [52.34513874272676]
We argue that existing methods rely too heavily on entangled visual embeddings over explicit semantic identity.<n>We parse fMRI signals into rich, sentence-level semantic descriptions that mirror the hierarchical and compositional nature of human visual understanding.<n>We propose SynMind, a framework that integrates these explicit semantic encodings with visual priors to condition a pretrained diffusion model.
arXiv Detail & Related papers (2026-01-25T14:31:23Z)
Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection [57.17054616831796]
Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection.<n>VLMs' underperformance is attributed to task-model misalignment.<n>In this paper, we formalize AIGI detection as two complementary tasks--semantic consistency checking and pixel-artifact detection.
arXiv Detail & Related papers (2025-12-07T09:19:00Z)
INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts [0.0]
Current forensic systems degrade sharply under real-world conditions.<n>Most detectors operate as opaques, offering little insight into why an image is flagged as synthetic.<n>We introduce INSIGHT, a unified framework for robust detection and transparent explanation of AI-generated images.
arXiv Detail & Related papers (2025-11-27T11:43:50Z)
Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations [56.816929931908824]
We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-16T04:18:48Z)
Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs [43.08776932101172]
We build a dataset of AI-generated images annotated with bounding boxes and descriptive captions.<n>We then finetune MLLMs through a multi-stage optimization strategy.<n>The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws.
arXiv Detail & Related papers (2025-06-08T08:47:44Z)
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics [66.14786900470158]
We propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics.<n>FakeScope identifies AI-synthetic images with high accuracy and provides rich, interpretable, and query-driven forensic insights.<n>FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios.
arXiv Detail & Related papers (2025-03-31T16:12:48Z)
LEGION: Learning to Ground and Explain for Synthetic Image Detection [49.958951540410816]
We introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations.<n>It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels.<n>We propose LEGION, a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation.
arXiv Detail & Related papers (2025-03-19T14:37:21Z)
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation [15.442558725312976]
We introduce FakeVLM, a specialized large multimodal model for both general synthetic image and DeepFake detection tasks.<n>FakeVLM excels in distinguishing real from fake images and provides clear, natural language explanations for image artifacts.<n>We present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language.
arXiv Detail & Related papers (2025-03-19T05:14:44Z)
Explainable Synthetic Image Detection through Diffusion Timestep Ensembling [30.298198387824275]
We propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps.<n>To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module.<n>Our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively.
arXiv Detail & Related papers (2025-03-08T13:04:20Z)
SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting [11.216906046169683]
SAGI-D is the largest and most diverse dataset of AI-generated inpaintings.<n>Our experiments show that semantic alignment significantly improves image quality and aesthetics.<n>Using SAGI-D for training several image forensic approaches increases in-domain detection performance on average by 37.4%.
arXiv Detail & Related papers (2025-02-10T15:56:28Z)
HawkI: Homography & Mutual Information Guidance for 3D-free Single Image to Aerial View [67.8213192993001]
We present HawkI, for synthesizing aerial-view images from text and an exemplar image. HawkI blends the visual features from the input image within a pretrained text-to-2Dimage stable diffusion model. At inference, HawkI employs a unique mutual information guidance formulation to steer the generated image towards faithfully replicating the semantic details of the input-image.
arXiv Detail & Related papers (2023-11-27T01:41:25Z)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)
Deep CG2Real: Synthetic-to-Real Translation via Image Disentanglement [78.58603635621591]
Training an unpaired synthetic-to-real translation network in image space is severely under-constrained. We propose a semi-supervised approach that operates on the disentangled shading and albedo layers of the image. Our two-stage pipeline first learns to predict accurate shading in a supervised fashion using physically-based renderings as targets.
arXiv Detail & Related papers (2020-03-27T21:45:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.