Related papers: Learning and Forgetting Unsafe Examples in Large Language Models

Learning and Forgetting Unsafe Examples in Large Language Models

URL: http://arxiv.org/abs/2312.12736v2
Date: Wed, 3 Jul 2024 06:13:31 GMT
Title: Learning and Forgetting Unsafe Examples in Large Language Models
Authors: Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren,
Abstract summary: Large language models (LLMs) learn from third-party custom finetuning data. We show that while aligned LLMs can readily learn unsafe content, they also tend to forget it more significantly when finetuned on safer content. We introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data.
Score: 41.115096910603086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.

Related papers

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment [24.364891513019444]
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface.<n>We propose LARF, a Layer-Aware Representation Filtering method.<n> Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features.
arXiv Detail & Related papers (2025-07-24T17:59:24Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization [7.1060720569792215]
Fine-tuning large language models (LLMs) can inadvertently compromise their safety.<n>We introduce a safety-aware probing (SAP) framework designed to mitigate the safety risks.<n>Our experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model.
arXiv Detail & Related papers (2025-05-22T14:52:10Z)
Safety Pretraining: Toward the Next Generation of Safe AI [61.2816320807586]
We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date, generated via recontextualization of harmful web data; and (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content.
arXiv Detail & Related papers (2025-04-23T17:58:08Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient. To avoid performance degradation and preserve safe performance, we advocate for a two-step framework. We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z)
Locking Down the Finetuned LLMs Safety [33.56657036839617]
Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks. Existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning. We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning.
arXiv Detail & Related papers (2024-10-14T09:58:29Z)
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router [42.222681564769076]
We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models. HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
arXiv Detail & Related papers (2024-10-03T17:10:41Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.