Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)
- URL: http://arxiv.org/abs/2505.14608v1
- Date: Tue, 20 May 2025 16:55:44 GMT
- Title: Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)
- Authors: Rafael Rivera Soto, Barry Chen, Nicholas Andrews,
- Abstract summary: We examine which language models can be optimized to degrade the performance of machine-text detectors.<n>We show that even when models are optimized against stylistic detectors, detection performance remains surprisingly unaffected.<n>We explore a new approach that simultaneously aims to close the gap between human writing machine writing feature space while avoiding detection using traditional features.
- Score: 4.148732457277201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space$\unicode{x2013}$the stylistic feature space$\unicode{x2013}$that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. This observation encourages us to introduce AURA, a metric that estimates the overlap between human and machine-generated distributions by analyzing how detector performance improves as more samples become available. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.
Related papers
- Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors [4.7713095161046555]
We present a pipeline to test the resilience of state-of-the-art MGT detectors to linguistically informed adversarial attacks.<n>We fine-tune language models to shift the MGT style toward human-written text (HWT)<n>This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect.
arXiv Detail & Related papers (2025-05-30T12:33:30Z) - TempTest: Local Normalization Distortion and the Detection of Machine-generated Text [0.0]
We introduce a method for detecting machine-generated text that is entirely of the generating language model.<n>This is achieved by targeting a defect in the way that decoding strategies, such as temperature or top-k sampling, normalize conditional probability measures.<n>We evaluate our detector in the white and black box settings across various language models, datasets, and passage lengths.
arXiv Detail & Related papers (2025-03-26T10:56:59Z) - ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability [62.285407189502216]
Detecting texts generated by Large Language Models (LLMs) could cause grave mistakes due to incorrect decisions.<n>We introduce ExaGPT, an interpretable detection approach grounded in the human decision-making process.<n>We show that ExaGPT massively outperforms prior powerful detectors by up to +40.9 points of accuracy at a false positive rate of 1%.
arXiv Detail & Related papers (2025-02-17T01:15:07Z) - A Practical Examination of AI-Generated Text Detectors for Large Language Models [25.919278893876193]
Machine-generated content detectors claim to identify such text under various conditions and from any language model.<n>This paper critically evaluates these claims by assessing several popular detectors on a range of domains, datasets, and models that these detectors have not previously encountered.
arXiv Detail & Related papers (2024-12-06T15:56:11Z) - Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors [57.7003399760813]
We explore advanced Large Language Models (LLMs) and their specialized variants, contributing to this field in several ways.
We uncover a significant correlation between topics and detection performance.
These investigations shed light on the adaptability and robustness of these detection methods across diverse topics.
arXiv Detail & Related papers (2023-12-20T10:53:53Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.<n>The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.<n>We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - TextShield: Beyond Successfully Detecting Adversarial Sentences in Text
Classification [6.781100829062443]
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications.
Previous detection methods are incapable of giving correct predictions on adversarial sentences.
We propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not.
arXiv Detail & Related papers (2023-02-03T22:58:07Z) - Adversarially Robust One-class Novelty Detection [83.1570537254877]
We show that existing novelty detectors are susceptible to adversarial examples.
We propose a defense strategy that manipulates the latent space of novelty detectors to improve the robustness against adversarial examples.
arXiv Detail & Related papers (2021-08-25T10:41:29Z) - Detection of Adversarial Supports in Few-shot Classifiers Using Feature
Preserving Autoencoders and Self-Similarity [89.26308254637702]
We propose a detection strategy to highlight adversarial support sets.
We make use of feature preserving autoencoder filtering and also the concept of self-similarity of a support set to perform this detection.
Our method is attack-agnostic and also the first to explore detection for few-shot classifiers to the best of our knowledge.
arXiv Detail & Related papers (2020-12-09T14:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.