OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with
Adversarially Generated Examples
- URL: http://arxiv.org/abs/2307.11729v3
- Date: Sun, 18 Feb 2024 12:25:13 GMT
- Title: OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with
Adversarially Generated Examples
- Authors: Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
- Abstract summary: OUTFOX is a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output.
Experiments show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score.
The detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts.
- Score: 44.118047780553006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have achieved human-level fluency in text
generation, making it difficult to distinguish between human-written and
LLM-generated texts. This poses a growing risk of misuse of LLMs and demands
the development of detectors to identify LLM-generated texts. However, existing
detectors lack robustness against attacks: they degrade detection accuracy by
simply paraphrasing LLM-generated texts. Furthermore, a malicious user might
attempt to deliberately evade the detectors based on detection results, but
this has not been assumed in previous studies. In this paper, we propose
OUTFOX, a framework that improves the robustness of LLM-generated-text
detectors by allowing both the detector and the attacker to consider each
other's output. In this framework, the attacker uses the detector's prediction
labels as examples for in-context learning and adversarially generates essays
that are harder to detect, while the detector uses the adversarially generated
essays as examples for in-context learning to learn to detect essays from a
strong attacker. Experiments in the domain of student essays show that the
proposed detector improves the detection performance on the attacker-generated
texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows
a state-of-the-art detection performance: up to 96.9 points F1-score, beating
existing detectors on non-attacked texts. Finally, the proposed attacker
drastically degrades the performance of detectors by up to -57.0 points
F1-score, massively outperforming the baseline paraphrasing method for evading
detection.
Related papers
- Evading AI-Generated Content Detectors using Homoglyphs [0.0]
Homoglyph-based attacks that can be used to circumvent existing LLM detectors are presented.
A comprehensive evaluation is conducted to assess the effectiveness of homoglyphs on state-of-the-art LLM detectors.
arXiv Detail & Related papers (2024-06-17T06:07:32Z) - Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore [51.65730053591696]
We propose a simple but effective black-box zero-shot detection approach.
It is predicated on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts.
Our method achieves an average AUROC of 98.7% and shows strong robustness against paraphrase and adversarial perturbation attacks.
arXiv Detail & Related papers (2024-05-07T12:57:01Z) - Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated
Student Essay Detection [29.433764586753956]
Large language models (LLMs) have exhibited remarkable capabilities in text generation tasks.
The utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises.
This paper aims to bridge this gap by constructing AIG-ASAP, an AI-generated student essay dataset.
arXiv Detail & Related papers (2024-02-01T08:11:56Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z) - MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs)
We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples.
Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - TextShield: Beyond Successfully Detecting Adversarial Sentences in Text
Classification [6.781100829062443]
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications.
Previous detection methods are incapable of giving correct predictions on adversarial sentences.
We propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not.
arXiv Detail & Related papers (2023-02-03T22:58:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.