Related papers: Making Harmful Behaviors Unlearnable for Large Language Models

Making Harmful Behaviors Unlearnable for Large Language Models

URL: http://arxiv.org/abs/2311.02105v1
Date: Thu, 2 Nov 2023 09:18:21 GMT
Title: Making Harmful Behaviors Unlearnable for Large Language Models
Authors: Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract summary: Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
Score: 50.44915524846857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. To meet the requirements of different applications, LLMs are often customized by further fine-tuning. However, the powerful learning ability of LLMs not only enables them to acquire new tasks but also makes them susceptible to learning undesired behaviors. For example, even safety-aligned LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. Can we train LLMs on harmful data without learning harmful behaviors? This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process. Specifically, we introduce ``security vectors'', a few new parameters that can be separated from the LLM, to ensure LLM's responses are consistent with the harmful behavior. Security vectors are activated during fine-tuning, the consistent behavior makes LLM believe that such behavior has already been learned, there is no need to further optimize for harmful data. During inference, we can deactivate security vectors to restore the LLM's normal behavior. The experimental results show that the security vectors generated by 100 harmful samples are enough to prevent LLM from learning 1000 harmful samples, while preserving the ability to learn other useful information.

Related papers

Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions [17.485655062129965]
Recent AI agents rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions. We propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples. Experiments on 9 popular open-source LLMs demonstrate the effectiveness of our PT-ALIGN for safety alignment, while maintaining comparable levels of helpfulness and usefulness.
arXiv Detail & Related papers (2025-02-08T09:54:47Z)
What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers. These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs. Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [92.85175340702125]
We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
arXiv Detail & Related papers (2024-10-02T13:12:13Z)
zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning. We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z)
AI Meets the Classroom: When Does ChatGPT Harm Learning? [0.0]
We study how generative AI and specifically large language models (LLMs) impact learning in coding classes. We show across three studies that LLM usage can have positive and negative effects on learning outcomes.
arXiv Detail & Related papers (2024-08-29T17:07:46Z)
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs [13.03032975937872]
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities.
arXiv Detail & Related papers (2024-07-22T11:19:14Z)
Do LLM Agents Have Regret? A Case Study in Online Learning and Games [30.377709765198592]
Large language models (LLMs) have been increasingly employed for (interactive) decision-making. We study their interactions in benchmark decision-making settings in online learning and game theory. We propose a novel emphun training loss of emphregret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (supervised) actions.
arXiv Detail & Related papers (2024-03-25T15:04:11Z)
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models. We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.