Large Language Models are Vulnerable to Bait-and-Switch Attacks for
Generating Harmful Content
- URL: http://arxiv.org/abs/2402.13926v1
- Date: Wed, 21 Feb 2024 16:46:36 GMT
- Title: Large Language Models are Vulnerable to Bait-and-Switch Attacks for
Generating Harmful Content
- Authors: Federico Bianchi, James Zou
- Abstract summary: Even safe text coming from large language models can be turned into potentially dangerous content through Bait-and-Switch attacks.
The alarming efficacy of this approach highlights a significant challenge in developing reliable safety guardrails for LLMs.
- Score: 33.99403318079253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The risks derived from large language models (LLMs) generating deceptive and
damaging content have been the subject of considerable research, but even safe
generations can lead to problematic downstream impacts. In our study, we shift
the focus to how even safe text coming from LLMs can be easily turned into
potentially dangerous content through Bait-and-Switch attacks. In such attacks,
the user first prompts LLMs with safe questions and then employs a simple
find-and-replace post-hoc technique to manipulate the outputs into harmful
narratives. The alarming efficacy of this approach in generating toxic content
highlights a significant challenge in developing reliable safety guardrails for
LLMs. In particular, we stress that focusing on the safety of the verbatim LLM
outputs is insufficient and that we also need to consider post-hoc
transformations.
Related papers
- Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs)
In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents.
We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z) - Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.
This vulnerability poses significant risks to the real-world applications.
We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation [44.09578786678573]
Large Language Models (LLMs) are implicit troublemakers.
LLMs could be used to gather harmful data or launch covert attacks.
We name this decoding strategy: Jailbreak Value Decoding (JVD)
arXiv Detail & Related papers (2024-08-20T09:11:21Z) - Robustifying Safety-Aligned Large Language Models through Clean Data Curation [11.273749179260468]
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content.
In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios.
arXiv Detail & Related papers (2024-05-24T04:50:38Z) - MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [36.03512474289962]
This paper investigates the novel challenge of defending MLLMs against malicious attacks through visual inputs.
Images act as a foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses.
We introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier.
arXiv Detail & Related papers (2024-01-05T17:05:42Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.
Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.
We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly [21.536079040559517]
Large Language Models (LLMs) have revolutionized natural language understanding and generation.
This paper explores the intersection of LLMs with security and privacy.
arXiv Detail & Related papers (2023-12-04T16:25:18Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - Goal-Oriented Prompt Attack and Safety Evaluation for LLMs [43.93613764464993]
We introduce a pipeline to construct high-quality prompt attack samples, along with a Chinese prompt attack dataset called CPAD.
Our prompts aim to induce LLMs to generate unexpected outputs with several carefully designed prompt attack templates.
The results show that our prompts are significantly harmful to LLMs, with around 70% attack success rate to GPT-3.5.
arXiv Detail & Related papers (2023-09-21T07:07:49Z) - LLM Censorship: A Machine Learning Challenge or a Computer Security
Problem? [52.71988102039535]
We show that semantic censorship can be perceived as an undecidable problem.
We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
arXiv Detail & Related papers (2023-07-20T09:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.