Related papers: When Semantics Regulate: Rethinking Patch Shuffle and Internal Bias for Generated Image Detection with CLIP

When Semantics Regulate: Rethinking Patch Shuffle and Internal Bias for Generated Image Detection with CLIP

URL: http://arxiv.org/abs/2511.19126v1
Date: Mon, 24 Nov 2025 13:54:00 GMT
Title: When Semantics Regulate: Rethinking Patch Shuffle and Internal Bias for Generated Image Detection with CLIP
Authors: Beilin Chu, Weike You, Mengtao Li, Tingting Zheng, Kehan Zhao, Xuan Xu, Zhigao Lu, Jia Song, Moxuan Xu, Linna Zhou,
Abstract summary: CLIP-based detectors often rely on semantic cues rather than generator artifacts, leading to brittle performance under distribution shifts.<n>We show that Patch Shuffle provides an unusually strong benefit for CLIP, that disrupts global semantic continuity.<n>We propose SemAnti, a semantic-antagonistic fine-tuning paradigm that freezes the semantic subspace and adapts only artifact-sensitive layers.
Score: 13.360123625878733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid progress of GANs and Diffusion Models poses new challenges for detecting AI-generated images. Although CLIP-based detectors exhibit promising generalization, they often rely on semantic cues rather than generator artifacts, leading to brittle performance under distribution shifts. In this work, we revisit the nature of semantic bias and uncover that Patch Shuffle provides an unusually strong benefit for CLIP, that disrupts global semantic continuity while preserving local artifact cues, which reduces semantic entropy and homogenizes feature distributions between natural and synthetic images. Through a detailed layer-wise analysis, we further show that CLIP's deep semantic structure functions as a regulator that stabilizes cross-domain representations once semantic bias is suppressed. Guided by these findings, we propose SemAnti, a semantic-antagonistic fine-tuning paradigm that freezes the semantic subspace and adapts only artifact-sensitive layers under shuffled semantics. Despite its simplicity, SemAnti achieves state-of-the-art cross-domain generalization on AIGCDetectBenchmark and GenImage, demonstrating that regulating semantics is key to unlocking CLIP's full potential for robust AI-generated image detection.

Related papers

Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior [12.202765237400143]
We propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences.<n>LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions.
arXiv Detail & Related papers (2025-12-08T12:31:07Z)
Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection [57.17054616831796]
Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection.<n>VLMs' underperformance is attributed to task-model misalignment.<n>In this paper, we formalize AIGI detection as two complementary tasks--semantic consistency checking and pixel-artifact detection.
arXiv Detail & Related papers (2025-12-07T09:19:00Z)
INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts [0.0]
Current forensic systems degrade sharply under real-world conditions.<n>Most detectors operate as opaques, offering little insight into why an image is flagged as synthetic.<n>We introduce INSIGHT, a unified framework for robust detection and transparent explanation of AI-generated images.
arXiv Detail & Related papers (2025-11-27T11:43:50Z)
Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection [58.927873049646024]
We show that fake images cannot be properly aligned with corresponding captions compared to real images.<n>We propose a simple yet effective ITEM by leveraging the image-text misalignment in a joint visual-language space as discriminative clues.
arXiv Detail & Related papers (2025-11-01T06:51:14Z)
Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images [96.43608872116347]
AnomReason is a large-scale benchmark with structured annotations as quadruple textbfAnomAgent<n>AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images.
arXiv Detail & Related papers (2025-10-11T14:09:24Z)
GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection [26.484706270778318]
We propose GAMMA, a novel training framework designed to reduce domain bias and enhance semantic alignment.<n>We employ multi-task supervision with dual segmentation heads and a classification head, enabling pixel-level source attribution across diverse generative domains.<n>Our method achieves state-of-the-art generalization performance on the GenImage benchmark, imporving accuracy by 5.8%, but also maintains strong robustness on newly released generative model such as GPT-4o.
arXiv Detail & Related papers (2025-09-12T13:46:54Z)
Semantic-Aware Reconstruction Error for Detecting AI-Generated Images [22.83053631078616]
We propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction.<n>SARE provides a robust and discriminative feature for detecting fake images across diverse generative models.<n>We also introduce a fusion module that integrates SARE into the backbone detector via a cross-attention mechanism.
arXiv Detail & Related papers (2025-08-13T04:37:36Z)
NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection [14.7077339945096]
NS-Net is a novel framework that decouples semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images.<n>Experiments show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4% improvement in detection accuracy.
arXiv Detail & Related papers (2025-08-02T07:58:15Z)
Towards Imperceptible JPEG Image Hiding: Multi-range Representations-driven Adversarial Stego Generation [19.5984577708016]
We propose a multi-range representations-driven adversarial stego generation framework called MRAG for JPEG image hiding.<n>MRAG integrates the local-range characteristic of the convolution and the global-range modeling of the transformer.<n>It computes the adversarial loss between covers and stegos based on the surrogate steganalyzer's classified features.
arXiv Detail & Related papers (2025-07-11T06:45:07Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection [58.87142367781417]
A naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked.<n>One potential remedy is incorporating the pre-trained knowledge within the vision foundation models to expand the feature space.<n>By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns.
arXiv Detail & Related papers (2024-11-23T19:10:32Z)
SLLEN: Semantic-aware Low-light Image Enhancement Network [92.80325772199876]
We develop a semantic-aware LLE network (SSLEN) composed of a LLE main-network (LLEmN) and a SS auxiliary-network (SSaN) Unlike currently available approaches, the proposed SLLEN is able to fully lever the semantic information, e.g., IEF, HSF, and SS dataset, to assist LLE. Comparisons between the proposed SLLEN and other state-of-the-art techniques demonstrate the superiority of SLLEN with respect to LLE quality.
arXiv Detail & Related papers (2022-11-21T15:29:38Z)
No Token Left Behind: Explainability-Aided Image Classification and Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input. Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
Causal Intervention for Weakly-Supervised Semantic Segmentation [122.1846968696862]
We aim to generate better pixel-level pseudo-masks by using only image-level labels. We propose a structural causal model to analyze the causalities among images, contexts, and class labels. Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification.
arXiv Detail & Related papers (2020-09-26T09:26:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.