Red Teaming Language Model Detectors with Language Models
- URL: http://arxiv.org/abs/2305.19713v2
- Date: Thu, 19 Oct 2023 05:56:52 GMT
- Title: Red Teaming Language Model Detectors with Language Models
- Authors: Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang,
Cho-Jui Hsieh
- Abstract summary: Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
- Score: 114.36392560711022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The prevalence and strong capability of large language models (LLMs) present
significant safety and ethical risks if exploited by malicious users. To
prevent the potentially deceptive usage of LLMs, recent works have proposed
algorithms to detect LLM-generated text and protect LLMs. In this paper, we
investigate the robustness and reliability of these LLM detectors under
adversarial attacks. We study two types of attack strategies: 1) replacing
certain words in an LLM's output with their synonyms given the context; 2)
automatically searching for an instructional prompt to alter the writing style
of the generation. In both strategies, we leverage an auxiliary LLM to generate
the word replacements or the instructional prompt. Different from previous
works, we consider a challenging setting where the auxiliary LLM can also be
protected by a detector. Experiments reveal that our attacks effectively
compromise the performance of all detectors in the study with plausible
generations, underscoring the urgent need to improve the robustness of
LLM-generated text detection systems.
Related papers
- "I know myself better, but not really greatly": Using LLMs to Detect and Explain LLM-Generated Texts [10.454446545249096]
Large language models (LLMs) have demonstrated impressive capabilities in generating human-like texts.
This paper explores the detection and explanation capabilities of LLM-based detectors of human-generated texts.
arXiv Detail & Related papers (2025-02-18T11:00:28Z) - Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context [49.13497493053742]
Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts, which were easily detectable by automated methods.
We address this gap by focusing on human-readable adversarial prompts, a more realistic and potent threat.
Our key contributions are situation-driven attacks leveraging movie scripts to create contextually relevant, human-readable prompts that successfully deceive LLMs.
arXiv Detail & Related papers (2024-12-20T21:43:52Z) - What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers.
These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs.
Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z) - Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.
Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration.
To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z) - RAFT: Realistic Attacks to Fool Text Detectors [16.749257564123194]
Large language models (LLMs) have exhibited remarkable fluency across various tasks.
Their unethical applications, such as disseminating disinformation, have become a growing concern.
We present RAFT: a grammar error-free black-box attack against existing LLM detectors.
arXiv Detail & Related papers (2024-10-04T17:59:00Z) - ReMoDetect: Reward Models Recognize Aligned LLM's Generations [55.06804460642062]
Large language models (LLMs) generate human-preferable texts.
In this paper, we identify the common characteristics shared by these models.
We propose two training schemes to further improve the detection ability of the reward model.
arXiv Detail & Related papers (2024-05-27T17:38:33Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Detecting LLM-Assisted Writing in Scientific Communication: Are We There Yet? [2.894383634912343]
Large Language Models (LLMs) have significantly reshaped text generation, particularly in the realm of writing assistance.
A potential avenue to encourage accurate acknowledging of LLM-assisted writing involves employing automated detectors.
Our evaluation of four cutting-edge LLM-generated text detectors reveals their suboptimal performance compared to a simple ad-hoc detector.
arXiv Detail & Related papers (2024-01-30T08:07:28Z) - A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions [39.36381851190369]
There is an imperative need to develop detectors that can detect LLM-generated text.
This is crucial to mitigate potential misuse of LLMs and safeguard realms like artistic expression and social networks from harmful influence of LLM-generated content.
The detector techniques have witnessed notable advancements recently, propelled by innovations in watermarking techniques, statistics-based detectors, neural-base detectors, and human-assisted methods.
arXiv Detail & Related papers (2023-10-23T09:01:13Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.