Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks
- URL: http://arxiv.org/abs/2505.23192v1
- Date: Thu, 29 May 2025 07:31:17 GMT
- Title: Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks
- Authors: Run Hao, Peng Ying,
- Abstract summary: The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits.<n>In this work, we propose an automated adversarial prompt generation framework.<n>Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits, raising serious concerns about identity misuse and the robustness of AIGC detectors. In this work, we propose an automated adversarial prompt generation framework that leverages a grammar tree structure and a variant of the Monte Carlo tree search algorithm to systematically explore the semantic prompt space. Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors. Extensive experiments across multiple T2I models validate its effectiveness, and the approach ranked first in a real-world adversarial AIGC detection competition. Beyond attack scenarios, our method can also be used to construct high-quality adversarial datasets, providing valuable resources for training and evaluating more robust AIGC detection and defense systems.
Related papers
- Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance [22.94094331220455]
ForgeryEraser is a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors.<n>We show that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on global synthesis and local editing benchmarks.
arXiv Detail & Related papers (2026-02-06T09:32:10Z) - Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming [0.0]
We introduce a novel command & control (C2) architecture leveraging the Model Context Protocol (MCP) to coordinate adaptive reconnaissance agents covertly across networks.<n>We find that our architecture not only improves goal-directed behavior of the system as whole, but also eliminates key host and network artifacts that can be used to detect and prevent command & control behavior altogether.
arXiv Detail & Related papers (2025-11-20T02:51:04Z) - PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks [2.540711742769252]
We investigate why iteratively-paraphrased text evades detection systems designed for AIGT identification.<n>We introduce PADBen, the first benchmark systematically evaluating detector robustness against paraphrase attack scenarios.
arXiv Detail & Related papers (2025-11-01T05:59:46Z) - ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z) - Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations [56.816929931908824]
We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-16T04:18:48Z) - Hybrid LLM-Enhanced Intrusion Detection for Zero-Day Threats in IoT Networks [6.087274577167399]
This paper presents a novel approach to intrusion detection by integrating traditional signature-based methods with the contextual understanding capabilities of the GPT-2 Large Language Model (LLM)<n>We propose a hybrid IDS framework that merges the robustness of signature-based techniques with the adaptability of GPT-2-driven semantic analysis.<n> Experimental evaluations on a representative intrusion dataset demonstrate that our model enhances detection accuracy by 6.3%, reduces false positives by 9.0%, and maintains near real-time responsiveness.
arXiv Detail & Related papers (2025-07-10T04:10:03Z) - AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models [78.08374249341514]
The rapid development of AI-generated content (AIGC) has led to the misuse of AI-generated images (AIGI) in spreading misinformation.<n>We introduce a large-scale and comprehensive dataset, Holmes-Set, which includes an instruction-tuning dataset with explanations on whether images are AI-generated.<n>Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control.<n>In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization
arXiv Detail & Related papers (2025-07-03T14:26:31Z) - RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors [57.81012948133832]
We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples.<n>Our methodology generates adversarial images that transfer with a high success rate to unseen detectors.<n>Our findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples.
arXiv Detail & Related papers (2025-06-04T14:16:00Z) - Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
Presentation Attack Detection and Face Forgery Detection are designed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes respectively.<n>Separate training of these two models makes them vulnerable to unknown attacks and burdens deployment environments.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework (HiPTune) that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z) - Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations [23.612489763464357]
We propose a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action.<n> Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity.
arXiv Detail & Related papers (2025-04-22T02:21:19Z) - Unleashing the Power of Pre-trained Encoders for Universal Adversarial Attack Detection [21.03032944637112]
Adrial attacks pose a critical security threat to real-world AI systems.<n>This paper proposes a lightweight adversarial detection framework based on the large-scale pre-trained vision-language model CLIP.
arXiv Detail & Related papers (2025-04-01T05:21:45Z) - Are AI-Generated Text Detectors Robust to Adversarial Perturbations? [9.001160538237372]
Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations.
This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN)
The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations.
arXiv Detail & Related papers (2024-06-03T10:21:48Z) - RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection [60.960988614701414]
RIGID is a training-free and model-agnostic method for robust AI-generated image detection.
RIGID significantly outperforms existing trainingbased and training-free detectors.
arXiv Detail & Related papers (2024-05-30T14:49:54Z) - Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - ToBlend: Token-Level Blending With an Ensemble of LLMs to Attack AI-Generated Text Detection [6.27025292177391]
ToBlend is a novel token-level ensemble text generation method to challenge the robustness of current AI-content detection approaches.
We find ToBlend significantly drops the performance of most mainstream AI-content detection methods.
arXiv Detail & Related papers (2024-02-17T02:25:57Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z) - Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.<n>The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.<n>We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Dual Adversarial Auto-Encoders for Clustering [152.84443014554745]
We propose Dual Adversarial Auto-encoder (Dual-AAE) for unsupervised clustering.
By performing variational inference on the objective function of Dual-AAE, we derive a new reconstruction loss which can be optimized by training a pair of Auto-encoders.
Experiments on four benchmarks show that Dual-AAE achieves superior performance over state-of-the-art clustering methods.
arXiv Detail & Related papers (2020-08-23T13:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.