Red Teaming Language Model Detectors with Language Models
- URL: http://arxiv.org/abs/2305.19713v2
- Date: Thu, 19 Oct 2023 05:56:52 GMT
- Title: Red Teaming Language Model Detectors with Language Models
- Authors: Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang,
Cho-Jui Hsieh
- Abstract summary: Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
- Score: 114.36392560711022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The prevalence and strong capability of large language models (LLMs) present
significant safety and ethical risks if exploited by malicious users. To
prevent the potentially deceptive usage of LLMs, recent works have proposed
algorithms to detect LLM-generated text and protect LLMs. In this paper, we
investigate the robustness and reliability of these LLM detectors under
adversarial attacks. We study two types of attack strategies: 1) replacing
certain words in an LLM's output with their synonyms given the context; 2)
automatically searching for an instructional prompt to alter the writing style
of the generation. In both strategies, we leverage an auxiliary LLM to generate
the word replacements or the instructional prompt. Different from previous
works, we consider a challenging setting where the auxiliary LLM can also be
protected by a detector. Experiments reveal that our attacks effectively
compromise the performance of all detectors in the study with plausible
generations, underscoring the urgent need to improve the robustness of
LLM-generated text detection systems.
Related papers
- ReMoDetect: Reward Models Recognize Aligned LLM's Generations [55.06804460642062]
Large language models (LLMs) generate human-preferable texts.
We propose two training schemes to further improve the detection ability of the reward model.
arXiv Detail & Related papers (2024-05-27T17:38:33Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - LLM-Detector: Improving AI-Generated Chinese Text Detection with
Open-Source LLM Instruction Tuning [4.328134379418151]
Existing AI-generated text detection models are prone to in-domain over-fitting.
We propose LLM-Detector, a novel method for both document-level and sentence-level text detection.
arXiv Detail & Related papers (2024-02-02T05:54:12Z) - Detecting LLM-Assisted Writing in Scientific Communication: Are We There Yet? [2.894383634912343]
Large Language Models (LLMs) have significantly reshaped text generation, particularly in the realm of writing assistance.
A potential avenue to encourage accurate acknowledging of LLM-assisted writing involves employing automated detectors.
Our evaluation of four cutting-edge LLM-generated text detectors reveals their suboptimal performance compared to a simple ad-hoc detector.
arXiv Detail & Related papers (2024-01-30T08:07:28Z) - Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models [63.91178922306669]
We introduce Silent Guardian, a text protection mechanism against large language models (LLMs)
By carefully modifying the text to be protected, TPE can induce LLMs to first sample the end token, thus directly terminating the interaction.
We show that SG can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases.
arXiv Detail & Related papers (2023-12-15T10:30:36Z) - Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.
Recent literature reveals that LLMs generate nonfactual responses intermittently.
We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z) - A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions [39.36381851190369]
There is an imperative need to develop detectors that can detect LLM-generated text.
This is crucial to mitigate potential misuse of LLMs and safeguard realms like artistic expression and social networks from harmful influence of LLM-generated content.
The detector techniques have witnessed notable advancements recently, propelled by innovations in watermarking techniques, statistics-based detectors, neural-base detectors, and human-assisted methods.
arXiv Detail & Related papers (2023-10-23T09:01:13Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with
Adversarially Generated Examples [44.118047780553006]
OUTFOX is a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output.
Experiments show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score.
The detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts.
arXiv Detail & Related papers (2023-07-21T17:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.