Related papers: BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

URL: http://arxiv.org/abs/2410.13334v3
Date: Thu, 02 Jan 2025 04:06:46 GMT
Title: BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models
Authors: Isack Lee, Haebin Seong,
Abstract summary: We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by safety-induced biases in large language models (LLMs)<n>We propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation.<n>Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.

Related papers

CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models [18.06388944779541]
"jailbreaking" is the use of large language models to trigger unintended behaviors. We propose a novel method to balance the jailbreak attack success rate with semantic coherence. Our method is superior to state-of-the-art baselines in attack effectiveness.
arXiv Detail & Related papers (2025-02-17T02:49:26Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts.<n>We frame jailbreaks as inference-time misalignment and introduce LIAR, a fast, black-box, best-of-$N$ sampling attack requiring no training.<n>We also introduce a theoretical "safety net against jailbreaks" metric to quantify safety alignment strength and derive suboptimality bounds.
arXiv Detail & Related papers (2024-12-06T18:02:59Z)
SQL Injection Jailbreak: A Structural Disaster of Large Language Models [71.55108680517422]
In this paper, we introduce a novel jailbreak method, which induces large language models (LLMs) to produce harmful content. By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content. We propose a simple defense method called Self-Reminder-Key to counter SIJ and demonstrate its effectiveness.
arXiv Detail & Related papers (2024-11-03T13:36:34Z)
Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks. This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions. Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z)
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models [8.423787598133972]
This paper uncovers a critical vulnerability in the function calling process of large language models (LLMs) We introduce a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs.
arXiv Detail & Related papers (2024-07-25T10:09:21Z)
Jailbreak Attacks and Defenses Against Large Language Models: A Survey [22.392989536664288]
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks. "jailbreaking" induces the model to generate malicious responses against the usage policy and society. We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
arXiv Detail & Related papers (2024-07-05T06:57:30Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content. We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content. Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.