Attacking Neural Text Detectors
- URL: http://arxiv.org/abs/2002.11768v4
- Date: Wed, 19 Jan 2022 09:28:08 GMT
- Title: Attacking Neural Text Detectors
- Authors: Max Wolff, Stuart Wolff
- Abstract summary: This paper presents two classes of black-box attacks on neural text detectors.
The homoglyph and misspelling attacks decrease a popular neural text detector's recall on neural text from 97.44% to 0.26% and 22.68%, respectively.
Results also indicate that the attacks are transferable to other neural text detectors.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning based language models have recently made significant
progress, which introduces a danger to spread misinformation. To combat this
potential danger, several methods have been proposed for detecting text written
by these language models. This paper presents two classes of black-box attacks
on these detectors, one which randomly replaces characters with homoglyphs, and
the other a simple scheme to purposefully misspell words. The homoglyph and
misspelling attacks decrease a popular neural text detector's recall on neural
text from 97.44% to 0.26% and 22.68%, respectively. Results also indicate that
the attacks are transferable to other neural text detectors.
Related papers
- Neural Fingerprints for Adversarial Attack Detection [2.7309692684728613]
A well known vulnerability of deep learning models is their susceptibility to adversarial examples.
Many algorithms have been proposed to address this problem, falling generally into one of two categories.
We argue that in a white-box setting, where the attacker knows the configuration and weights of the network and the detector, they can overcome the detector.
This problem is common in security applications where even a very good model is not sufficient to ensure safety.
arXiv Detail & Related papers (2024-11-07T08:43:42Z) - Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors [57.7003399760813]
We explore advanced Large Language Models (LLMs) and their specialized variants, contributing to this field in several ways.
We uncover a significant correlation between topics and detection performance.
These investigations shed light on the adaptability and robustness of these detection methods across diverse topics.
arXiv Detail & Related papers (2023-12-20T10:53:53Z) - Efficient Black-Box Adversarial Attacks on Neural Text Detectors [1.223779595809275]
We investigate three simple strategies to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors.
The results show that especially parameter tweaking and character-level mutations are effective strategies.
arXiv Detail & Related papers (2023-11-03T12:29:32Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Mutation-Based Adversarial Attacks on Neural Text Detectors [1.5101132008238316]
We propose character- and word-based mutation operators for generating adversarial samples to attack state-of-the-art natural text detectors.
In such attacks, attackers have access to the original text and create mutation instances based on this original text.
arXiv Detail & Related papers (2023-02-11T22:08:32Z) - Putting words into the system's mouth: A targeted attack on neural
machine translation using monolingual data poisoning [50.67997309717586]
We propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation.
This sample is designed to induce a specific, targeted translation behaviour, such as peddling misinformation.
We present two methods for crafting poisoned examples, and show that only a tiny handful of instances, amounting to only 0.02% of the training set, is sufficient to enact a successful attack.
arXiv Detail & Related papers (2021-07-12T08:07:09Z) - Hidden Backdoors in Human-Centric Language Models [12.694861859949585]
We create covert and natural triggers for textual backdoor attacks.
We deploy our hidden backdoors through two state-of-the-art trigger embedding methods.
We demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical NLP tasks.
arXiv Detail & Related papers (2021-05-01T04:41:00Z) - Anomaly Detection-Based Unknown Face Presentation Attack Detection [74.4918294453537]
Anomaly detection-based spoof attack detection is a recent development in face Presentation Attack Detection.
In this paper, we present a deep-learning solution for anomaly detection-based spoof attack detection.
The proposed approach benefits from the representation learning power of the CNNs and learns better features for fPAD task.
arXiv Detail & Related papers (2020-07-11T21:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.