Making Harmful Behaviors Unlearnable for Large Language Models
- URL: http://arxiv.org/abs/2311.02105v1
- Date: Thu, 2 Nov 2023 09:18:21 GMT
- Title: Making Harmful Behaviors Unlearnable for Large Language Models
- Authors: Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang
- Abstract summary: Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains.
LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content.
This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
- Score: 50.44915524846857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown great potential as general-purpose AI
assistants in various domains. To meet the requirements of different
applications, LLMs are often customized by further fine-tuning. However, the
powerful learning ability of LLMs not only enables them to acquire new tasks
but also makes them susceptible to learning undesired behaviors. For example,
even safety-aligned LLMs can be easily fine-tuned into harmful assistants as
the fine-tuning data often contains implicit or explicit harmful content. Can
we train LLMs on harmful data without learning harmful behaviors? This paper
proposes a controllable training framework that makes harmful behaviors
unlearnable during the fine-tuning process. Specifically, we introduce
``security vectors'', a few new parameters that can be separated from the LLM,
to ensure LLM's responses are consistent with the harmful behavior. Security
vectors are activated during fine-tuning, the consistent behavior makes LLM
believe that such behavior has already been learned, there is no need to
further optimize for harmful data. During inference, we can deactivate security
vectors to restore the LLM's normal behavior. The experimental results show
that the security vectors generated by 100 harmful samples are enough to
prevent LLM from learning 1000 harmful samples, while preserving the ability to
learn other useful information.
Related papers
- Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [92.85175340702125]
We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels.
We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions.
Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
arXiv Detail & Related papers (2024-10-02T13:12:13Z) - zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning.
We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z) - AI Meets the Classroom: When Does ChatGPT Harm Learning? [0.0]
We study how generative AI and specifically large language models (LLMs) impact learning in coding classes.
We show across three studies that LLM usage can have positive and negative effects on learning outcomes.
arXiv Detail & Related papers (2024-08-29T17:07:46Z) - Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs [13.03032975937872]
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to.
Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities.
arXiv Detail & Related papers (2024-07-22T11:19:14Z) - Do LLM Agents Have Regret? A Case Study in Online Learning and Games [30.377709765198592]
Large language models (LLMs) have been increasingly employed for (interactive) decision-making.
We study their interactions in benchmark decision-making settings in online learning and game theory.
We propose a novel emphun training loss of emphregret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (supervised) actions.
arXiv Detail & Related papers (2024-03-25T15:04:11Z) - Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks.
Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning.
To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.