Related papers: Are AI-Generated Text Detectors Robust to Adversarial Perturbations?

Are AI-Generated Text Detectors Robust to Adversarial Perturbations?

URL: http://arxiv.org/abs/2406.01179v2
Date: Wed, 26 Jun 2024 12:43:56 GMT
Title: Are AI-Generated Text Detectors Robust to Adversarial Perturbations?
Authors: Guanhua Huang, Yuchen Zhang, Zhe Li, Yongjian You, Mingze Wang, Zhouwang Yang,
Abstract summary: Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN) The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations.
Score: 9.001160538237372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-created and AI-generated text. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations. We also propose a siamese calibration technique to train the model to make equally confidence predictions under different noise, which improves the model's robustness against adversarial perturbations. Experiments on four publicly available datasets show that the SCRN outperforms all baseline methods, achieving 6.5\%-18.25\% absolute accuracy improvement over the best baseline method under adversarial attacks. Moreover, it exhibits superior generalizability in cross-domain, cross-genre, and mixed-source scenarios. The code is available at \url{https://github.com/CarlanLark/Robust-AIGC-Detector}.

Related papers

AuthorMist: Evading AI Text Detectors with Reinforcement Learning [4.806579822134391]
AuthorMist is a novel reinforcement learning-based system to transform AI-generated text into human-like writing. We show that AuthorMist effectively reduces the detectability of AI-generated text while preserving the original meaning.
arXiv Detail & Related papers (2025-03-10T12:41:05Z)
Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection [60.09665704993751]
We introduce FairOPT, an algorithm for group-specific threshold optimization for probabilistic AI-text detectors.<n>Our framework paves the way for more robust classification in AI-generated content detection via post-processing.
arXiv Detail & Related papers (2025-02-06T21:58:48Z)
Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings [14.150011713654331]
This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts.
arXiv Detail & Related papers (2025-01-31T10:06:27Z)
Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts. We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z)
ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination [1.8418334324753884]
This paper introduces back-translation as a novel technique for evading detection. We present a model that combines these back-translated texts to produce a manipulated version of the original AI-generated text. We evaluate this technique on nine AI detectors, including six open-source and three proprietary systems.
arXiv Detail & Related papers (2024-09-22T01:13:22Z)
Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness. The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z)
RADAR: Robust AI-Text Detection via Adversarial Learning [69.5883095262619]
RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa.
arXiv Detail & Related papers (2023-07-07T21:13:27Z)
Towards Imperceptible Document Manipulations against Neural Ranking Models [13.777462017782659]
We propose a framework called Imperceptible DocumEnt Manipulation (IDEM) to produce adversarial documents. IDEM instructs a well-established generative language model, such as BART, to generate connection sentences without introducing easy-to-detect errors. We show that IDEM can outperform strong baselines while preserving fluency and correctness of the target documents.
arXiv Detail & Related papers (2023-05-03T02:09:29Z)
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking. We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z)
Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques. In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)
Language-Driven Anchors for Zero-Shot Adversarial Robustness [25.160195547250655]
We propose a Language-driven, Anchor-based Adversarial Training strategy. By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model. We show that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods.
arXiv Detail & Related papers (2023-01-30T17:34:43Z)
Robust lEarned Shrinkage-Thresholding (REST): Robust unrolling for sparse recover [87.28082715343896]
We consider deep neural networks for solving inverse problems that are robust to forward model mis-specifications. We design a new robust deep neural network architecture by applying algorithm unfolding techniques to a robust version of the underlying recovery problem. The proposed REST network is shown to outperform state-of-the-art model-based and data-driven algorithms in both compressive sensing and radar imaging problems.
arXiv Detail & Related papers (2021-10-20T06:15:45Z)
BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images) We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples. Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.