Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
- URL: http://arxiv.org/abs/2504.11686v1
- Date: Wed, 16 Apr 2025 01:02:46 GMT
- Title: Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
- Authors: Yiran He, Yun Cao, Bowen Yang, Zeyu Zhang,
- Abstract summary: multimodal Large Language Models (LLMs) have encoded rich world knowledge but struggle to comprehend local forgery details.<n>We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues.<n>We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods.
- Score: 18.989883830031093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.
Related papers
- Towards Explainable Fake Image Detection with Multi-Modal Large Language Models [38.09674979670241]
We argue that fake image detection should not operate as a "black box"
In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators.
We propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system.
arXiv Detail & Related papers (2025-04-19T09:42:25Z) - FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics [66.14786900470158]
We propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics.<n>FakeScope identifies AI-synthetic images with high accuracy and provides rich, interpretable, and query-driven forensic insights.<n>FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios.
arXiv Detail & Related papers (2025-03-31T16:12:48Z) - VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models [14.053424085561296]
Face models with high-quality and controllable attributes pose a significant challenge for Deepfake detection.
In this work, we integrate Multimodal Large Language Models (MLLMs) within DM-based face forensics.
We propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators.
arXiv Detail & Related papers (2025-03-08T09:55:19Z) - Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference [78.08901120841833]
We propose a method to detect the knowledge boundary of Visual Large Language Models (VLLMs)<n>We show that our method successfully depicts a VLLM's knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance.
arXiv Detail & Related papers (2025-02-25T09:32:08Z) - ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection [107.86009509291581]
We propose ForgerySleuth to perform comprehensive clue fusion and generate segmentation outputs indicating regions that are tampered with.<n>Our experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in robustness, generalization, and explainability.
arXiv Detail & Related papers (2024-11-29T04:35:18Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.12958154544838]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.<n>It captures high-order correlations of forged images from diverse linguistic feature spaces.<n>It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics [46.99625341531352]
DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation.
We investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection.
arXiv Detail & Related papers (2024-03-21T01:57:30Z) - Research about the Ability of LLM in the Tamper-Detection Area [20.620232937684133]
Large Language Models (LLMs) have emerged as the most powerful AI tools in addressing a diverse range of challenges.
We have collected five different LLMs developed by various companies: GPT-4, LLaMA, Bard, ERNIE Bot 4.0, and Tongyi Qianwen.
Most LLMs can identify composite pictures that are inconsistent with logic, and only more powerful LLMs can distinguish logical, but visible signs of tampering to the human eye.
arXiv Detail & Related papers (2024-01-24T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.