Related papers: Can ChatGPT Perform Image Splicing Detection? A Preliminary Study

Can ChatGPT Perform Image Splicing Detection? A Preliminary Study

URL: http://arxiv.org/abs/2506.05358v1
Date: Thu, 22 May 2025 13:53:53 GMT
Title: Can ChatGPT Perform Image Splicing Detection? A Preliminary Study
Authors: Souradip Nath,
Abstract summary: Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities.<n>We evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT)<n>Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities, showing promise in a variety of complex vision-language tasks. In this preliminary study, we investigate the out-of-the-box capabilities of GPT-4V in the domain of image forensics, specifically, in detecting image splicing manipulations. Without any task-specific fine-tuning, we evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT), applied over a curated subset of the CASIA v2.0 splicing dataset. Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy), with CoT prompting yielding the most balanced trade-off across authentic and spliced images. Qualitative analysis further reveals that the model not only detects low-level visual artifacts but also draws upon real-world contextual knowledge such as object scale, semantic consistency, and architectural facts, to identify implausible composites. While GPT-4V lags behind specialized state-of-the-art splicing detection models, its generalizability, interpretability, and encyclopedic reasoning highlight its potential as a flexible tool in image forensics.

Related papers

Preliminary Explorations with GPT-4o(mni) Native Image Generation [7.700772640399941]
Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI.<n>In this paper, we aim to explore the capabilities of GPT-4o across various tasks.
arXiv Detail & Related papers (2025-05-06T19:35:29Z)
A Unified Agentic Framework for Evaluating Conditional Image Generation [66.25099219134441]
Conditional image generation has gained significant attention for its ability to personalize content.<n>This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks.
arXiv Detail & Related papers (2025-04-09T17:04:14Z)
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation [28.235805447825896]
OpenAI's GPT4o model has demonstrated surprisingly good capabilities in image generation and editing.<n>This report presents the first-look evaluation benchmark (named GPT-ImgEval)<n>We show GPT-4o's performance across three critical dimensions: generation quality, (2) editing proficiency, and (3) world knowledge-informed synthesis.
arXiv Detail & Related papers (2025-04-03T17:23:16Z)
Exploring AI-based System Design for Pixel-level Protected Health Information Detection in Medical Images [0.5825410941577593]
We present an AI-based pipeline for PHI detection comprising text detection, text extraction, and text analysis.<n>We benchmark three models, YOLOv11, EasyOCR, and GPT-4o, across different setups corresponding to these components.<n>The combination of YOLOv11 for text localization and GPT-4o for extraction and analysis yields the best results.
arXiv Detail & Related papers (2025-01-16T14:12:33Z)
An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging [0.3029213689620348]
We explore the potential of the Gemini (textitgemini-1.0-pro-vision-latest) and GPT-4V models for medical image analysis. Both Gemini AI and GPT-4V are first used to classify real versus synthetic images, followed by an interpretation and analysis of the input images. Our early investigation presented in this work provides insights into the potential of MLLMs to assist with the classification and interpretation of retinal fundoscopy and lung X-ray images.
arXiv Detail & Related papers (2024-06-02T08:29:23Z)
Gemini Pro Defeated by GPT-4V: Evidence from Education [1.0226894006814744]
GPT-4V significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic Weighted Kappa. Findings suggest GPT-4V's superior capability in handling complex educational tasks.
arXiv Detail & Related papers (2023-12-27T02:56:41Z)
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision) The core of our analysis delves into the distinct visual comprehension abilities of each model. Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z)
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task. It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z)
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z)
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [103.68138147783614]
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models. We employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions, and overlay these regions with a set of marks. Using the marked image as input, GPT-4V can answer the questions that require visual grounding.
arXiv Detail & Related papers (2023-10-17T17:51:31Z)
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs. GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system. GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.