Mutation-Based Adversarial Attacks on Neural Text Detectors
- URL: http://arxiv.org/abs/2302.05794v1
- Date: Sat, 11 Feb 2023 22:08:32 GMT
- Title: Mutation-Based Adversarial Attacks on Neural Text Detectors
- Authors: Gongbo Liang, Jesus Guerrero, Izzat Alsmadi
- Abstract summary: We propose character- and word-based mutation operators for generating adversarial samples to attack state-of-the-art natural text detectors.
In such attacks, attackers have access to the original text and create mutation instances based on this original text.
- Score: 1.5101132008238316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural text detectors aim to decide the characteristics that distinguish
neural (machine-generated) from human texts. To challenge such detectors,
adversarial attacks can alter the statistical characteristics of the generated
text, making the detection task more and more difficult. Inspired by the
advances of mutation analysis in software development and testing, in this
paper, we propose character- and word-based mutation operators for generating
adversarial samples to attack state-of-the-art natural text detectors. This
falls under white-box adversarial attacks. In such attacks, attackers have
access to the original text and create mutation instances based on this
original text. The ultimate goal is to confuse machine learning models and
classifiers and decrease their prediction accuracy.
Related papers
- Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts.
We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z) - Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors [57.7003399760813]
We explore advanced Large Language Models (LLMs) and their specialized variants, contributing to this field in several ways.
We uncover a significant correlation between topics and detection performance.
These investigations shed light on the adaptability and robustness of these detection methods across diverse topics.
arXiv Detail & Related papers (2023-12-20T10:53:53Z) - Efficient Black-Box Adversarial Attacks on Neural Text Detectors [1.223779595809275]
We investigate three simple strategies to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors.
The results show that especially parameter tweaking and character-level mutations are effective strategies.
arXiv Detail & Related papers (2023-11-03T12:29:32Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - On Decoding Strategies for Neural Text Generators [73.48162198041884]
We study the interaction between language generation tasks and decoding strategies.
We measure changes in attributes of generated text as a function of both decoding strategy and task.
Our results reveal both previously-observed and surprising findings.
arXiv Detail & Related papers (2022-03-29T16:25:30Z) - Adversarial Robustness of Neural-Statistical Features in Detection of
Generative Transformers [6.209131728799896]
We evaluate neural and non-neural approaches on their ability to detect computer-generated text.
We find that while statistical features underperform neural features, statistical features provide additional adversarial robustness.
We pioneer the usage of $Delta$MAUVE as a proxy measure for human judgement of adversarial text quality.
arXiv Detail & Related papers (2022-03-02T16:46:39Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z) - Attacking Neural Text Detectors [0.0]
This paper presents two classes of black-box attacks on neural text detectors.
The homoglyph and misspelling attacks decrease a popular neural text detector's recall on neural text from 97.44% to 0.26% and 22.68%, respectively.
Results also indicate that the attacks are transferable to other neural text detectors.
arXiv Detail & Related papers (2020-02-19T04:18:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.