Related papers: LookAhead Tuning: Safer Language Models via Partial Answer Previews

Related papers

Token-level Data Selection for Safe LLM Fine-tuning [15.039068315115372]
Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications.<n>Recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety.<n>We propose a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model.
arXiv Detail & Related papers (2026-03-01T16:52:05Z)
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines [31.031589383127677]
This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework.<n>It internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against adversarial prompts.<n> experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
arXiv Detail & Related papers (2025-11-26T09:44:32Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning [24.176983833455413]
Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications.<n>These models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns.<n>We propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning.
arXiv Detail & Related papers (2025-07-22T07:40:16Z)
Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks [32.73803760326097]
Finetuning-as-a-Service (F) allows users to customize Large Language Models (LLMs) using their own data.<n>Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data.<n>We propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework.
arXiv Detail & Related papers (2025-06-09T02:10:51Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing [62.43110639295449]
Large Language Models (LLMs) are widely applied in decision making, but their deployment is threatened by jailbreak attacks.<n>Delman is a novel approach leveraging direct model editing for precise, dynamic protection against jailbreak attacks.<n>Delman directly updates a minimal set of relevant parameters to neutralize harmful behaviors while preserving the model's utility.
arXiv Detail & Related papers (2025-02-17T10:39:21Z)
STAIR: Improving Safety Alignment with Introspective Reasoning [44.780098674618614]
We propose STAIR, a framework that integrates SafeTy Alignment with Itrospective Reasoning.<n>We show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies.<n>With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks.
arXiv Detail & Related papers (2025-02-04T15:02:55Z)
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.<n> Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.<n>We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z)
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging [43.44112117935541]
Fine-tuning large language models (LLMs) for downstream tasks often leads to safety degradation in safety-aligned LLMs.<n>We propose a method that maintains the inherent safety of LLMs while enhancing their downstream task performance.
arXiv Detail & Related papers (2024-12-27T08:03:22Z)
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning [37.024666077902225]
A handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model.<n>Existing methods to counteract fine-tuning attacks typically require substantial computational resources.<n>We propose textbfNeuron-textbfLevel textbfSafety textbfRealignment.
arXiv Detail & Related papers (2024-12-17T02:59:04Z)
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models [13.887770576598646]
Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions.
arXiv Detail & Related papers (2024-12-11T08:44:15Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts. We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.