BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models
- URL: http://arxiv.org/abs/2410.13334v3
- Date: Thu, 02 Jan 2025 04:06:46 GMT
- Title: BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models
- Authors: Isack Lee, Haebin Seong,
- Abstract summary: We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by safety-induced biases in large language models (LLMs)
We propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation.
Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased.
- Score: 0.0
- License:
- Abstract: Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.
Related papers
- CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models [18.06388944779541]
"jailbreaking" is the use of large language models to trigger unintended behaviors.
We propose a novel method to balance the jailbreak attack success rate with semantic coherence.
Our method is superior to state-of-the-art baselines in attack effectiveness.
arXiv Detail & Related papers (2025-02-17T02:49:26Z) - xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.
We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)
We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.
We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.
Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - SQL Injection Jailbreak: A Structural Disaster of Large Language Models [71.55108680517422]
We introduce a novel jailbreak method, which targets the external properties of LLMs.
By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.
We propose a simple defense method called Self-Reminder-Key to counter SIJ.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks.
This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions.
Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z) - Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.
We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.
Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z) - How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs.
Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content.
We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.