Related papers: Robust LLM safeguarding via refusal feature adversarial training

Robust LLM safeguarding via refusal feature adversarial training

URL: http://arxiv.org/abs/2409.20089v1
Date: Mon, 30 Sep 2024 08:41:39 GMT
Title: Robust LLM safeguarding via refusal feature adversarial training
Authors: Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda,
Abstract summary: Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. We propose Refusal Feature Adrial Training (ReFAT), a novel algorithm that efficiently performs adversarial training. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks.
Score: 15.76605079209956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

Related papers

ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks [61.06621533874629]
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs)<n>In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts.<n>Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio.
arXiv Detail & Related papers (2025-07-02T03:09:20Z)
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models [50.89022445197919]
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs)<n>Recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment.
arXiv Detail & Related papers (2025-05-26T08:25:25Z)
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent [32.958798200220286]
Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience. We propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs. Our method first identifies the insertion position for maximum impact with minimal input modification.
arXiv Detail & Related papers (2025-04-13T05:31:37Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond [41.3029262040131]
We investigate how to make unlearned models robust against relearning attacks. Our analysis reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks.
arXiv Detail & Related papers (2025-02-07T23:03:55Z)
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks [25.212057612342218]
Large language models (LLMs) are susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. We propose a Latent-space Adversarial Training with Post-aware framework to address this problem.
arXiv Detail & Related papers (2025-01-18T02:57:12Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Defending against Reverse Preference Attacks is Difficult [26.872318173182414]
Large Language Models (LLMs) are vulnerable to training-time attacks such as supervised fine-tuning (SFT) on harmful datasets. We propose Reverse Preference Attacks (RPA) to make LLMs learn harmful behavior using adversarial reward during reinforcement learning from human feedback.
arXiv Detail & Related papers (2024-09-19T17:10:34Z)
Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
We develop a method for building tamper-resistant safeguards into open-weight LLMs. We find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem.
arXiv Detail & Related papers (2024-08-01T17:59:12Z)
Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD) PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z)
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats. This paper presents an innovative defensive strategy, given white box access to an LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z)
Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning [21.018893978967053]
In-Context Learning (ICL) is sensitive to the choice, order, and verbaliser used to encode the demonstrations in the prompt. Retrieval-Augmented ICL methods try to address this problem by leveraging retrievers to extract semantically related examples as demonstrations. Our study reveals that retrieval-augmented models can enhance robustness against test sample attacks. We introduce an effective training-free adversarial defence method, DARD, which enriches the example pool with those attacked samples.
arXiv Detail & Related papers (2024-05-24T23:56:36Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
Learning to Poison Large Language Models During Instruction Tuning [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently. We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks. We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks. We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.