Related papers: OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

URL: http://arxiv.org/abs/2307.11729v3
Date: Sun, 18 Feb 2024 12:25:13 GMT
Title: OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples
Authors: Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
Abstract summary: OUTFOX is a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. Experiments show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. The detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts.
Score: 44.118047780553006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.

Related papers

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors [4.7713095161046555]
We present a pipeline to test the resilience of state-of-the-art MGT detectors to linguistically informed adversarial attacks.<n>We fine-tune language models to shift the MGT style toward human-written text (HWT)<n>This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect.
arXiv Detail & Related papers (2025-05-30T12:33:30Z)
Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors [65.27124213266491]
We propose textbfContrastive textbfParaphrase textbfAttack (CoPA), a training-free method that effectively deceives text detectors.<n>CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by large language models.<n>Our theoretical analysis suggests the superiority of the proposed attack.
arXiv Detail & Related papers (2025-05-21T10:08:39Z)
ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability [62.285407189502216]
Detecting texts generated by Large Language Models (LLMs) could cause grave mistakes due to incorrect decisions. We introduce ExaGPT, an interpretable detection approach grounded in the human decision-making process. We show that ExaGPT massively outperforms prior powerful detectors by up to +40.9 points of accuracy at a false positive rate of 1%.
arXiv Detail & Related papers (2025-02-17T01:15:07Z)
A Practical Examination of AI-Generated Text Detectors for Large Language Models [25.919278893876193]
Machine-generated content detectors claim to identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors on a range of domains, datasets, and models that these detectors have not previously encountered.
arXiv Detail & Related papers (2024-12-06T15:56:11Z)
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios [38.952481877244644]
We present a new benchmark, DetectRL, highlighting that even state-of-the-art (SOTA) detection techniques still underperformed in this task. Our development of DetectRL reveals the strengths and limitations of current SOTA detectors. We believe DetectRL could serve as an effective benchmark for assessing detectors in real-world scenarios.
arXiv Detail & Related papers (2024-10-31T09:01:25Z)
Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework [9.976099891796784]
Large language models (LLMs) have transformed human writing by enhancing grammar correction, content expansion, and stylistic refinement. Existing detection methods, which mainly rely on single-feature analysis and binary classification, often fail to effectively identify LLM-generated text in academic contexts. We propose a novel Multi-level Fine-grained Detection framework that detects LLM-generated text by integrating low-level structural, high-level semantic, and deep-level linguistic features.
arXiv Detail & Related papers (2024-10-18T07:25:00Z)
RAFT: Realistic Attacks to Fool Text Detectors [16.749257564123194]
Large language models (LLMs) have exhibited remarkable fluency across various tasks. Their unethical applications, such as disseminating disinformation, have become a growing concern. We present RAFT: a grammar error-free black-box attack against existing LLM detectors.
arXiv Detail & Related papers (2024-10-04T17:59:00Z)
Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore [51.65730053591696]
We propose a simple but effective black-box zero-shot detection approach. It is predicated on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts. Our method achieves an average AUROC of 98.7% and shows strong robustness against paraphrase and adversarial perturbation attacks.
arXiv Detail & Related papers (2024-05-07T12:57:01Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection. We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs. Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z)
MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs) We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples. Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z)
Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques. In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)
TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification [6.781100829062443]
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. Previous detection methods are incapable of giving correct predictions on adversarial sentences. We propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not.
arXiv Detail & Related papers (2023-02-03T22:58:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.