Related papers: Token-level Data Selection for Safe LLM Fine-tuning

Token-level Data Selection for Safe LLM Fine-tuning

URL: http://arxiv.org/abs/2603.01185v1
Date: Sun, 01 Mar 2026 16:52:05 GMT
Title: Token-level Data Selection for Safe LLM Fine-tuning
Authors: Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang,
Abstract summary: Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications.<n>Recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety.<n>We propose a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model.
Score: 15.039068315115372
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.

Related papers

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Gradient Surgery for Safe LLM Fine-Tuning [16.652518818576425]
Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user's fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs)<n>We find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases.<n>We propose SafeGrad, a novel method that employs gradient surgery.
arXiv Detail & Related papers (2025-08-10T04:13:41Z)
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment [24.364891513019444]
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface.<n>We propose LARF, a Layer-Aware Representation Filtering method.<n> Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features.
arXiv Detail & Related papers (2025-07-24T17:59:24Z)
Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization [7.1060720569792215]
Fine-tuning large language models (LLMs) can inadvertently compromise their safety.<n>We introduce a safety-aware probing (SAP) framework designed to mitigate the safety risks.<n>Our experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model.
arXiv Detail & Related papers (2025-05-22T14:52:10Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.<n>We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)<n>UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z)
Safety Layers in Aligned Large Language Models: The Key to LLM Security [43.805905164456846]
Internal parameters in aligned LLMs can be vulnerable to security degradation when subjected to fine-tuning attacks.<n>Our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model.<n>We propose a novel fine-tuning approach, Safely Partial- Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation.
arXiv Detail & Related papers (2024-08-30T04:35:59Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Safe MDP Planning by Learning Temporal Patterns of Undesirable Trajectories and Averting Negative Side Effects [27.41101006357176]
In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. operating based on an incomplete model can often produce unintended negative side effects (NSEs)
arXiv Detail & Related papers (2023-04-06T14:03:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.