Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
- URL: http://arxiv.org/abs/2408.01420v1
- Date: Fri, 2 Aug 2024 17:55:50 GMT
- Title: Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
- Authors: Jingtong Su, Julia Kempe, Karen Ullrich,
- Abstract summary: Large language models (LLMs) are trained on a deluge of text data with limited quality control.
Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour.
Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective.
- Score: 6.627477206883248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the AdvBench and HarmBench project without sacrificing model performance as measured by the MT-Bench project.
Related papers
- Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation [20.176424063726277]
Large language models (LLMs) are extensively adapted for downstream applications through a process known as "customization"
Recent studies have revealed a vulnerability that tuning LLMs with malicious samples can compromise their robustness and amplify harmful content, an attack known as "jailbreaking"
arXiv Detail & Related papers (2024-10-03T05:24:38Z) - Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
Large Language Models (LLMs) have demonstrated great potential as generalist assistants.
It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts.
In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
arXiv Detail & Related papers (2024-07-11T17:52:03Z) - How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs.
Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content.
We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z) - Robustifying Safety-Aligned Large Language Models through Clean Data Curation [11.273749179260468]
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content.
In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios.
arXiv Detail & Related papers (2024-05-24T04:50:38Z) - FLAME: Factuality-Aware Alignment for Large Language Models [86.76336610282401]
The conventional alignment process fails to enhance the factual accuracy of large language models (LLMs)
We identify factors that lead to hallucination in both alignment steps: supervised fine-tuning (SFT) and reinforcement learning (RL)
We propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization.
arXiv Detail & Related papers (2024-05-02T17:54:54Z) - Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs [9.624124576891075]
Existing alignment methods can lead large language models (LLMs) to be Adaptive Chameleons when external evidence conflicts with their parametric memory.
We propose a novel framework: Dialectical Alignment (DA), which utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts.
Our experiments show that the DA improves poisoned data attack defense by 20 and does not require any additional prompt engineering.
arXiv Detail & Related papers (2024-03-30T22:41:05Z) - DeAL: Decoding-time Alignment for Large Language Models [59.63643988872571]
Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences.
We propose DeAL, a framework that allows the user to customize reward functions and enables Detime Alignment of LLMs.
Our experiments show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs.
arXiv Detail & Related papers (2024-02-05T06:12:29Z) - The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context
Learning [61.68787689234622]
A recent study, LIMA, shows that using merely 1K examples for alignment tuning can achieve significant alignment performance as well.
This raises questions about how exactly the alignment tuning transforms a base LLM.
We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting.
arXiv Detail & Related papers (2023-12-04T00:46:11Z) - Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration [32.15773300068426]
Membership Inference Attacks (MIAs) aim to infer whether a target data record has been utilized for model training or not.
We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA)
Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal.
arXiv Detail & Related papers (2023-11-10T13:55:05Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.