Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection
- URL: http://arxiv.org/abs/2511.00427v1
- Date: Sat, 01 Nov 2025 06:51:14 GMT
- Title: Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection
- Authors: Daichi Zhang, Tong Zhang, Jianmin Bao, Shiming Ge, Sabine Süsstrunk,
- Abstract summary: We show that fake images cannot be properly aligned with corresponding captions compared to real images.<n>We propose a simple yet effective ITEM by leveraging the image-text misalignment in a joint visual-language space as discriminative clues.
- Score: 58.927873049646024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of generative models, detecting generated fake images to prevent their malicious use has become a critical issue recently. Existing methods frame this challenge as a naive binary image classification task. However, such methods focus only on visual clues, yielding trained detectors susceptible to overfitting specific image patterns and incapable of generalizing to unseen models. In this paper, we address this issue from a multi-modal perspective and find that fake images cannot be properly aligned with corresponding captions compared to real images. Upon this observation, we propose a simple yet effective detector termed ITEM by leveraging the image-text misalignment in a joint visual-language space as discriminative clues. Specifically, we first measure the misalignment of the images and captions in pre-trained CLIP's space, and then tune a MLP head to perform the usual detection task. Furthermore, we propose a hierarchical misalignment scheme that first focuses on the whole image and then each semantic object described in the caption, which can explore both global and fine-grained local semantic misalignment as clues. Extensive experiments demonstrate the superiority of our method against other state-of-the-art competitors with impressive generalization and robustness on various recent generative models.
Related papers
- The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment [105.31858867473845]
ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing.<n>In experiments, ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
arXiv Detail & Related papers (2025-11-25T18:40:25Z) - Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection [18.382178646073474]
We propose RISE, a paradigm that exploits the entire training dataset to generate pseudo-labels for single images.<n>It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries.<n>In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval.
arXiv Detail & Related papers (2025-10-21T09:12:26Z) - Color Bind: Exploring Color Perception in Text-to-Image Models [40.094195503306295]
We introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors.<n>Our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.
arXiv Detail & Related papers (2025-08-27T11:16:58Z) - A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization [22.725542948364357]
We argue that the basic binary forgery mask is inadequate for explaining model predictions.<n>In this study, we generate salient region-focused interpretation for the forgery images.<n>We develop ForgeryTalker, an architecture designed for concurrent forgery localization and interpretation.
arXiv Detail & Related papers (2024-12-27T15:23:39Z) - MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.<n>Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.<n>We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z) - Generalizable Person Re-Identification via Viewpoint Alignment and
Fusion [74.30861504619851]
This work proposes to use a 3D dense pose estimation model and a texture mapping module to map pedestrian images to canonical view images.
Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images.
We show that our method can lead to superior performance over the existing approaches in various evaluation settings.
arXiv Detail & Related papers (2022-12-05T16:24:09Z) - Shrinking the Semantic Gap: Spatial Pooling of Local Moment Invariants
for Copy-Move Forgery Detection [7.460203098159187]
Copy-move forgery is a manipulation of copying and pasting specific patches from and to an image, with potentially illegal or unethical uses.
Recent advances in the forensic methods for copy-move forgery have shown increasing success in detection accuracy and robustness.
For images with high self-similarity or strong signal corruption, the existing algorithms often exhibit inefficient processes and unreliable results.
arXiv Detail & Related papers (2022-07-19T09:11:43Z) - LatteGAN: Visually Guided Language Attention for Multi-Turn
Text-Conditioned Image Manipulation [0.0]
We present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN)
LatteGAN extracts fine-grained text representations for the generator, and discriminates both the global and local representations of fake or real images.
Experiments on two distinct MTIM datasets, CoDraw and i-CLEVR, demonstrate the state-of-the-art performance of the proposed model.
arXiv Detail & Related papers (2021-12-28T03:50:03Z) - Region-level Active Learning for Cluttered Scenes [60.93811392293329]
We introduce a new strategy that subsumes previous Image-level and Object-level approaches into a generalized, Region-level approach.
We show that this approach significantly decreases labeling effort and improves rare object search on realistic data with inherent class-imbalance and cluttered scenes.
arXiv Detail & Related papers (2021-08-20T14:02:38Z) - Fine-Grained Image Captioning with Global-Local Discriminative Objective [80.73827423555655]
We propose a novel global-local discriminative objective to facilitate generating fine-grained descriptive captions.
We evaluate the proposed method on the widely used MS-COCO dataset.
arXiv Detail & Related papers (2020-07-21T08:46:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.