Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models
- URL: http://arxiv.org/abs/2511.08484v1
- Date: Wed, 12 Nov 2025 02:01:07 GMT
- Title: Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models
- Authors: Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens,
- Abstract summary: We propose patching for large language models (LLMs) like software versions.<n>Our method enables rapid remediation by prepending a compact, learnable prefix to an existing model.
- Score: 63.54707418559388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.
Related papers
- Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron [28.570811283705556]
Safety of large language models (LLMs) has increasingly emerged as a fundamental aspect of their development.<n>Existing safety alignment for LLMs is predominantly achieved through post-training methods.<n>We propose a safety-aware decoding method that requires only low-cost training of an expert model and employs a single neuron as a gating mechanism.
arXiv Detail & Related papers (2026-02-02T12:21:54Z) - Reimagining Safety Alignment with An Image [49.33281424100804]
Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries.<n>We propose Magic Image, an optimization-driven visual prompt framework that enhances security while reducing over-refusal.
arXiv Detail & Related papers (2025-11-01T11:27:07Z) - LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z) - Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region [13.962617572588393]
We show that template-anchored safety alignment is widespread across various aligned large language models (LLMs)<n>Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks.<n>We show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks.
arXiv Detail & Related papers (2025-02-19T18:42:45Z) - DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing [62.43110639295449]
Large Language Models (LLMs) are widely applied in decision making, but their deployment is threatened by jailbreak attacks.<n>Delman is a novel approach leveraging direct model editing for precise, dynamic protection against jailbreak attacks.<n>Delman directly updates a minimal set of relevant parameters to neutralize harmful behaviors while preserving the model's utility.
arXiv Detail & Related papers (2025-02-17T10:39:21Z) - Model-Editing-Based Jailbreak against Safety-aligned Large Language Models [13.887770576598646]
Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions.<n>This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters.<n>TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions.
arXiv Detail & Related papers (2024-12-11T08:44:15Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.