Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
- URL: http://arxiv.org/abs/2412.16359v1
- Date: Fri, 20 Dec 2024 21:43:52 GMT
- Title: Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
- Authors: Nilanjana Das, Edward Raff, Manas Gaur,
- Abstract summary: Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts, which were easily detectable by automated methods.
We address this gap by focusing on human-readable adversarial prompts, a more realistic and potent threat.
Our key contributions are situation-driven attacks leveraging movie scripts to create contextually relevant, human-readable prompts that successfully deceive LLMs.
- Score: 49.13497493053742
- License:
- Abstract: Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts, which were easily detectable by automated methods. We address this gap by focusing on human-readable adversarial prompts, a more realistic and potent threat. Our key contributions are situation-driven attacks leveraging movie scripts to create contextually relevant, human-readable prompts that successfully deceive LLMs, adversarial suffix conversion to transform nonsensical adversarial suffixes into meaningful text, and AdvPrompter with p-nucleus sampling, a method to generate diverse, human-readable adversarial suffixes, improving attack efficacy in models like GPT-3.5 and Gemma 7B. Our findings demonstrate that LLMs can be tricked by sophisticated adversaries into producing harmful responses with human-readable adversarial prompts and that there exists a scope for improvement when it comes to robust LLMs.
Related papers
- Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.
This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models [11.693095252994482]
We present POISONPROMPT, a novel backdoor attack capable of successfully compromising both hard and soft prompt-based LLMs.
Our findings highlight the potential security threats posed by backdoor attacks on prompt-based LLMs and emphasize the need for further research in this area.
arXiv Detail & Related papers (2023-10-19T03:25:28Z) - PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts [76.18347405302728]
This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic.
The adversarial prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving.
Our findings demonstrate that contemporary Large Language Models are not robust to adversarial prompts.
arXiv Detail & Related papers (2023-06-07T15:37:00Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.