Related papers: Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

URL: http://arxiv.org/abs/2506.07356v2
Date: Sat, 11 Oct 2025 06:41:14 GMT
Title: Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
Authors: Seokil Ham, Yubin Choi, Yujin Yang, Seungju Cho, Younghun Kim, Changick Kim,
Abstract summary: Finetuning-as-a-Service (F) allows users to customize Large Language Models (LLMs) using their own data.<n>Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data.<n>We propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework.
Score: 32.73803760326097
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS.

Related papers

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Gradient Surgery for Safe LLM Fine-Tuning [16.652518818576425]
Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user's fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs)<n>We find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases.<n>We propose SafeGrad, a novel method that employs gradient surgery.
arXiv Detail & Related papers (2025-08-10T04:13:41Z)
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment [24.364891513019444]
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface.<n>We propose LARF, a Layer-Aware Representation Filtering method.<n> Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features.
arXiv Detail & Related papers (2025-07-24T17:59:24Z)
GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models [23.667160042806064]
GUARD is a novel framework for guided unlearning and retention via data attribution.<n>At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning.<n>We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods.
arXiv Detail & Related papers (2025-06-12T17:49:09Z)
SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks [17.77094760401298]
We study the vulnerability of fine-tuned large language models to membership inference attacks (MIAs)<n>We propose SOFT, a novel defense technique that mitigates privacy leakage by leveraging influential data selection with an adjustable parameter to balance utility preservation and privacy protection.
arXiv Detail & Related papers (2025-06-12T07:23:56Z)
Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset [0.46040036610482665]
Large Language Models (LLMs) are increasingly integrated into critical systems in industries like healthcare and finance.<n>This gives rise to a range of attacks in which a user submits a malicious query and the LLM-system outputs a response that creates harm to the owner.<n>While security tools are being developed to counter these threats, there is little formal evaluation of their effectiveness and usability.
arXiv Detail & Related papers (2025-05-19T12:12:00Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
ExpShield: Safeguarding Web Text from Unauthorized Crawling and Language Modeling Exploitation [17.71790411163849]
We propose ExpShiled, a proactive self-defense mechanism that mitigates sample-specific memorization via imperceptible text perturbations.<n>Our approach requires no external collaboration while maintaining original readability.<n>Even with privacy backdoors, the Membership Inference Attack (MIA) AUC drops from 0.95 to 0.55, and instance exploitation approaches zero.
arXiv Detail & Related papers (2024-12-30T17:52:02Z)
LLM Unlearning via Loss Adjustment with Only Forget Data [20.310423152885217]
We introduce Forget data only Loss AjustmenT (FLAT), a "flat" loss adjustment approach which addresses these issues. Empirical results demonstrate that our approach achieves superior unlearning performance compared to existing methods.
arXiv Detail & Related papers (2024-10-14T23:43:33Z)
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment [56.2017039028998]
Fine-tuning of Language-Model-as-a-Service (LM) introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) We propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 safety examples, the maliciously finetuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance.
arXiv Detail & Related papers (2024-02-22T21:05:18Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data. This process might suffer from privacy issues and violations of data protection regulations. We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models. We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
Unlearning Protected User Attributes in Recommendations with Adversarial Training [10.268369743620159]
Collaborative filtering algorithms capture underlying consumption patterns, including the ones specific to particular demographics or protected information of users. These encoded biases can influence the decision of a recommendation system towards further separation of the contents provided to various demographic subgroups. In this work, we investigate the possibility and challenges of removing specific protected information of users from the learned interaction representations of a RS algorithm.
arXiv Detail & Related papers (2022-06-09T13:36:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.