Making Harmful Behaviors Unlearnable for Large Language Models
- URL: http://arxiv.org/abs/2311.02105v1
- Date: Thu, 2 Nov 2023 09:18:21 GMT
- Title: Making Harmful Behaviors Unlearnable for Large Language Models
- Authors: Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang
- Abstract summary: Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains.
LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content.
This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
- Score: 50.44915524846857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown great potential as general-purpose AI
assistants in various domains. To meet the requirements of different
applications, LLMs are often customized by further fine-tuning. However, the
powerful learning ability of LLMs not only enables them to acquire new tasks
but also makes them susceptible to learning undesired behaviors. For example,
even safety-aligned LLMs can be easily fine-tuned into harmful assistants as
the fine-tuning data often contains implicit or explicit harmful content. Can
we train LLMs on harmful data without learning harmful behaviors? This paper
proposes a controllable training framework that makes harmful behaviors
unlearnable during the fine-tuning process. Specifically, we introduce
``security vectors'', a few new parameters that can be separated from the LLM,
to ensure LLM's responses are consistent with the harmful behavior. Security
vectors are activated during fine-tuning, the consistent behavior makes LLM
believe that such behavior has already been learned, there is no need to
further optimize for harmful data. During inference, we can deactivate security
vectors to restore the LLM's normal behavior. The experimental results show
that the security vectors generated by 100 harmful samples are enough to
prevent LLM from learning 1000 harmful samples, while preserving the ability to
learn other useful information.
Related papers
- Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions [17.485655062129965]
Recent AI agents rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions.
We propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples.
Experiments on 9 popular open-source LLMs demonstrate the effectiveness of our PT-ALIGN for safety alignment, while maintaining comparable levels of helpfulness and usefulness.
arXiv Detail & Related papers (2025-02-08T09:54:47Z) - Differentially Private Steering for Large Language Model Alignment [55.30573701583768]
We present the first study of aligning Large Language Models with private datasets.
Our work proposes the textitunderlinePrivate underlineSteering for LLM underlineAment (PSA) algorithm.
Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance.
arXiv Detail & Related papers (2025-01-30T17:58:36Z) - Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.
This vulnerability poses significant risks to the real-world applications.
We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z) - HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [92.85175340702125]
We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels.
We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions.
Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
arXiv Detail & Related papers (2024-10-02T13:12:13Z) - zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning.
We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z) - AI Meets the Classroom: When Does ChatGPT Harm Learning? [0.0]
We study how generative AI and specifically large language models (LLMs) impact learning in coding classes.
We show across three studies that LLM usage can have positive and negative effects on learning outcomes.
arXiv Detail & Related papers (2024-08-29T17:07:46Z) - Do LLM Agents Have Regret? A Case Study in Online Learning and Games [30.377709765198592]
Large language models (LLMs) have been increasingly employed for (interactive) decision-making.
We study their interactions in benchmark decision-making settings in online learning and game theory.
We propose a novel emphun training loss of emphregret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (supervised) actions.
arXiv Detail & Related papers (2024-03-25T15:04:11Z) - Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks.
Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning.
To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.