Related papers: FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

URL: http://arxiv.org/abs/2505.15644v1
Date: Wed, 21 May 2025 15:22:45 GMT
Title: FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models
Authors: Zhen Sun, Ziyi Zhang, Zeren Luo, Zeyang Sha, Tianshuo Cong, Zheng Li, Shiwen Cui, Weiqiang Wang, Jiaheng Wei, Xinlei He, Qi Li, Qian Wang,
Abstract summary: We develop FragFake, the first dedicated benchmark dataset for edited image detection.<n>We use Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization.<n>This work is the first to reformulate localized image edit detection as a vision-language understanding task.
Score: 48.85744313139525
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.

Related papers

X-Edit: Detecting and Localizing Edits in Images Altered by Text-Guided Diffusion Models [3.610796534465868]
Experimental results demonstrate that X-Edit accurately localizes edits in images altered by text-guided diffusion models.<n>This highlights X-Edit's potential as a robust forensic tool for detecting and pinpointing manipulations introduced by advanced image editing techniques.
arXiv Detail & Related papers (2025-05-16T23:29:38Z)
PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models [80.98455219375862]
We present the first text-based image editing approach for object parts based on pre-trained diffusion models.<n>Our approach is preferred by users 77-90% of the time in conducted user studies.
arXiv Detail & Related papers (2025-02-06T13:08:43Z)
EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM [50.054404519821745]
We present a novel framework that integrates a multimodal Large Language Model for enhanced reasoning capabilities.<n>Our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush datasets.<n> Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits.
arXiv Detail & Related papers (2024-12-05T02:05:33Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
Diffusion Model-Based Image Editing: A Survey [46.244266782108234]
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks.<n>We provide an exhaustive overview of existing methods using diffusion models for image editing.<n>To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval.
arXiv Detail & Related papers (2024-02-27T14:07:09Z)
Rethinking Image Editing Detection in the Era of Generative AI Revolution [13.605053073689751]
The GRE dataset is a large-scale generative regional editing dataset with the following advantages. We perform experiments with proposed three tasks: edited image classification, edited method attribution and edited region localization. We expect that the GRE dataset can promote further research and exploration in the field of generative region editing detection.
arXiv Detail & Related papers (2023-11-29T07:35:35Z)
Weakly-supervised deepfake localization in diffusion-generated images [4.548755617115687]
We propose a weakly-supervised localization problem based on the Xception network as the backbone architecture. We show that the best performing detection method (based on local scores) is less sensitive to the looser supervision than to the mismatch in terms of dataset or generator.
arXiv Detail & Related papers (2023-11-08T10:27:36Z)
iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing. It generates images conditioned on a source image and a textual edit prompt. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [115.49488548588305]
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.<n>They either finetune the model, or invert the image in the latent space of the pretrained model.<n>They suffer from two problems: Unsatisfying results for selected regions and unexpected changes in non-selected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.