Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
- URL: http://arxiv.org/abs/2406.10630v1
- Date: Sat, 15 Jun 2024 13:24:22 GMT
- Title: Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
- Authors: Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, Siheng Chen,
- Abstract summary: Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing.
We for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method.
- Score: 51.85781332922943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods. Targeting this, we further propose a post-hoc defense method, which could rely on a fully automated pipeline: generation of defense data and further fine-tuning of the LLM. Extensive experiments show that our safety attack method can significantly compromise the LLM's safety alignment (e.g., reduce safety rate by 70\%), which can not be effectively defended by existing defense methods (at most 4\% absolute improvement), while our safety defense method can significantly enhance the attacked LLM's safety alignment (at most 69\% absolute improvement).
Related papers
- Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.
As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.
Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z) - Defending against Reverse Preference Attacks is Difficult [26.872318173182414]
Large Language Models (LLMs) are vulnerable to training-time attacks such as supervised fine-tuning (SFT) on harmful datasets.
We propose Reverse Preference Attacks (RPA) to make LLMs learn harmful behavior using adversarial reward during reinforcement learning from human feedback.
arXiv Detail & Related papers (2024-09-19T17:10:34Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA.
textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs)
We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks.
Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z) - Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks
with Self-Refinement [2.854482269849925]
Language models (LMs) are vulnerable to exploitation for adversarial misuse.
We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs.
We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses.
arXiv Detail & Related papers (2024-02-23T08:22:24Z) - FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated
Learning [66.56240101249803]
We study how hardening benign clients can affect the global model (and the malicious clients)
We propose a trigger reverse engineering based defense and show that our method can achieve improvement with guarantee robustness.
Our results on eight competing SOTA defense methods show the empirical superiority of our method on both single-shot and continuous FL backdoor attacks.
arXiv Detail & Related papers (2022-10-23T22:24:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.