Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings
- URL: http://arxiv.org/abs/2501.18998v1
- Date: Fri, 31 Jan 2025 10:06:27 GMT
- Title: Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings
- Authors: Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo,
- Abstract summary: This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT.
The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts.
- Score: 14.150011713654331
- License:
- Abstract: In recent years, text generation tools utilizing Artificial Intelligence (AI) have occasionally been misused across various domains, such as generating student reports or creative writings. This issue prompts plagiarism detection services to enhance their capabilities in identifying AI-generated content. Adversarial attacks are often used to test the robustness of AI-text generated detectors. This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts. Specifically, we employ different embedding techniques, including the Tsetlin Machine (TM), an interpretable approach in machine learning for this purpose. By combining synonyms and embedding similarity vectors, we demonstrates the state-of-the-art reduction in detection scores against Fast-DetectGPT. Particularly, in the XSum dataset, the detection score decreased from 0.4431 to 0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.
Related papers
- Are AI-Generated Text Detectors Robust to Adversarial Perturbations? [9.001160538237372]
Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations.
This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN)
The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations.
arXiv Detail & Related papers (2024-06-03T10:21:48Z) - Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated
Student Essay Detection [29.433764586753956]
Large language models (LLMs) have exhibited remarkable capabilities in text generation tasks.
The utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises.
This paper aims to bridge this gap by constructing AIG-ASAP, an AI-generated student essay dataset.
arXiv Detail & Related papers (2024-02-01T08:11:56Z) - Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors [57.7003399760813]
We explore advanced Large Language Models (LLMs) and their specialized variants, contributing to this field in several ways.
We uncover a significant correlation between topics and detection performance.
These investigations shed light on the adaptability and robustness of these detection methods across diverse topics.
arXiv Detail & Related papers (2023-12-20T10:53:53Z) - Towards Possibilities & Impossibilities of AI-generated Text Detection:
A Survey [97.33926242130732]
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses.
Despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs.
To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text.
arXiv Detail & Related papers (2023-10-23T18:11:32Z) - Watermarking Conditional Text Generation for AI Detection: Unveiling
Challenges and a Semantic-Aware Watermark Remedy [52.765898203824975]
We introduce a semantic-aware watermarking algorithm that considers the characteristics of conditional text generation and the input context.
Experimental results demonstrate that our proposed method yields substantial improvements across various text generation models.
arXiv Detail & Related papers (2023-07-25T20:24:22Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.
The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.
We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Mutation-Based Adversarial Attacks on Neural Text Detectors [1.5101132008238316]
We propose character- and word-based mutation operators for generating adversarial samples to attack state-of-the-art natural text detectors.
In such attacks, attackers have access to the original text and create mutation instances based on this original text.
arXiv Detail & Related papers (2023-02-11T22:08:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.