Related papers: Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack

Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack

URL: http://arxiv.org/abs/2404.01907v1
Date: Tue, 2 Apr 2024 12:49:22 GMT
Title: Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack
Authors: Ying Zhou, Ben He, Le Sun,
Abstract summary: We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness. The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
Score: 24.954755569786396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the development of large language models (LLMs), detecting whether text is generated by a machine becomes increasingly challenging in the face of malicious use cases like the spread of false information, protection of intellectual property, and prevention of academic plagiarism. While well-trained text detectors have demonstrated promising performance on unseen test data, recent research suggests that these detectors have vulnerabilities when dealing with adversarial attacks such as paraphrasing. In this paper, we propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness against such attacks. The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content. Furthermore, we explore the prospect of improving the model's robustness over iterative adversarial learning. Although some improvements in model robustness are observed, practical applications still face significant challenges. These findings shed light on the future development of AI-text detectors, emphasizing the need for more accurate and robust detection methods.

Related papers

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection [71.59834293521074]
We develop a framework to distinguish between human-authored and machine-generated text.<n>Our method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset.<n>Code, pretrained weights, and demo will be released.
arXiv Detail & Related papers (2025-10-07T08:14:45Z)
Deep Learning Models for Robust Facial Liveness Detection [56.08694048252482]
This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti-spoofing techniques.<n>By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision.
arXiv Detail & Related papers (2025-08-12T17:19:20Z)
Navigating the Shadows: Unveiling Effective Disturbances for Modern AI Content Detectors [24.954755569786396]
AI-text detection has emerged to distinguish between human and machine-generated content. Recent research indicates that these detection systems often lack robustness and struggle to effectively differentiate perturbed texts. Our work simulates real-world scenarios in both informal and professional writing, exploring the out-of-the-box performance of current detectors.
arXiv Detail & Related papers (2024-06-13T08:37:01Z)
Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection [29.433764586753956]
Large language models (LLMs) have exhibited remarkable capabilities in text generation tasks. The utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises. This paper aims to bridge this gap by constructing AIG-ASAP, an AI-generated student essay dataset.
arXiv Detail & Related papers (2024-02-01T08:11:56Z)
Towards Possibilities & Impossibilities of AI-generated Text Detection: A Survey [97.33926242130732]
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses. Despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs. To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text.
arXiv Detail & Related papers (2023-10-23T18:11:32Z)
Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models [0.47334880432883714]
We present an analysis of adversarial robustness exhibited by various hate-speech detection models. We devise and execute targeted attacks on the text by leveraging the TextAttack tool. This work paves the way for creating more robust and reliable hate-speech detection systems.
arXiv Detail & Related papers (2023-05-29T19:59:40Z)
Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques. In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers. Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z)
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks [0.2999888908665659]
Adversarial attacks are a major challenge faced by current machine learning research. Our work presents a model-agnostic detector of adversarial text examples.
arXiv Detail & Related papers (2022-04-10T09:24:41Z)
Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks. We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z)
Adversarial Robustness of Neural-Statistical Features in Detection of Generative Transformers [6.209131728799896]
We evaluate neural and non-neural approaches on their ability to detect computer-generated text. We find that while statistical features underperform neural features, statistical features provide additional adversarial robustness. We pioneer the usage of $Delta$MAUVE as a proxy measure for human judgement of adversarial text quality.
arXiv Detail & Related papers (2022-03-02T16:46:39Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News [57.9843300852526]
We introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions. To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles. In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visual-semantic inconsistencies.
arXiv Detail & Related papers (2020-09-16T14:13:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.