Robust LLM safeguarding via refusal feature adversarial training
- URL: http://arxiv.org/abs/2409.20089v1
- Date: Mon, 30 Sep 2024 08:41:39 GMT
- Title: Robust LLM safeguarding via refusal feature adversarial training
- Authors: Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda,
- Abstract summary: Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses.
We propose Refusal Feature Adrial Training (ReFAT), a novel algorithm that efficiently performs adversarial training.
Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks.
- Score: 15.76605079209956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.
Related papers
- Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond [41.3029262040131]
We investigate how to make unlearned models robust against relearning attacks.
Our analysis reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks.
arXiv Detail & Related papers (2025-02-07T23:03:55Z) - Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [49.09703018511403]
Large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system.
We propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights.
arXiv Detail & Related papers (2025-02-03T18:59:16Z) - Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks [25.212057612342218]
Large language models (LLMs) are susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs.
We propose a Latent-space Adversarial Training with Post-aware framework to address this problem.
arXiv Detail & Related papers (2025-01-18T02:57:12Z) - Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.
This vulnerability poses significant risks to the real-world applications.
We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z) - Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
We develop a method for building tamper-resistant safeguards into open-weight LLMs.
We find that our method greatly improves tamper-resistance while preserving benign capabilities.
arXiv Detail & Related papers (2024-08-01T17:59:12Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning [21.018893978967053]
In-Context Learning (ICL) is sensitive to the choice, order, and verbaliser used to encode the demonstrations in the prompt.
Retrieval-Augmented ICL methods try to address this problem by leveraging retrievers to extract semantically related examples as demonstrations.
Our study reveals that retrieval-augmented models can enhance robustness against test sample attacks.
We introduce an effective training-free adversarial defence method, DARD, which enriches the example pool with those attacked samples.
arXiv Detail & Related papers (2024-05-24T23:56:36Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.