Related papers: Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

URL: http://arxiv.org/abs/2601.08089v1
Date: Tue, 13 Jan 2026 00:07:24 GMT
Title: Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment
Authors: Qitao Tan, Xiaoying Song, Ningxi Cheng, Ninghao Liu, Xiaoming Zhai, Lingzi Hong, Yanzhi Wang, Zhen Xiang, Geng Yuan,
Abstract summary: Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction.<n>We propose textttQ-realign, a post-hoc defense method based on post-training quantization.<n>Our work provides a practical, turnkey solution for safety-aware deployment.
Score: 55.14890249389052
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Public large language models (LLMs) are typically safety-aligned during pretraining, yet task-specific fine-tuning required for deployment often erodes this alignment and introduces safety risks. Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction, leaving safety recovery tightly coupled with training and incurring high computational overhead and a complex workflow. To address these challenges, we propose \texttt{Q-realign}, a post-hoc defense method based on post-training quantization, guided by an analysis of representational structure. By reframing quantization as a dual-objective procedure for compression and safety, \texttt{Q-realign} decouples safety alignment from fine-tuning and naturally piggybacks into modern deployment pipelines. Experiments across multiple models and datasets demonstrate that our method substantially reduces unsafe behaviors while preserving task performance, with significant reductions in memory usage and GPU hours. Notably, our approach can recover the safety alignment of a fine-tuned 7B LLM on a single RTX 4090 within 40 minutes. Overall, our work provides a practical, turnkey solution for safety-aware deployment.

Related papers

Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models [57.006252510102506]
Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications.<n>We introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems.
arXiv Detail & Related papers (2026-02-12T22:03:35Z)
Understanding and Preserving Safety in Fine-Tuned LLMs [20.821783178639063]
Fine-tuning can substantially degrade safety alignment, even when the fine-tuning data is harmless.<n>We propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace.<n> SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios.
arXiv Detail & Related papers (2026-01-15T07:33:13Z)
Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance [20.0828672005664]
We show that safety alignment can be fully recovered with only a single safety example.<n>We uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible.
arXiv Detail & Related papers (2026-01-05T08:26:34Z)
Alignment-Aware Quantization for LLM Safety [30.635936212381726]
Safety and efficiency are important factors when deploying large language models (LLMs)<n>We propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive( APC) loss into the PTQ pipeline.<n>AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families.
arXiv Detail & Related papers (2025-11-11T05:24:30Z)
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space [91.99501941169831]
GuardSpace is a guardrail framework for preserving safety alignment throughout fine-tuning.<n>For Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT.
arXiv Detail & Related papers (2025-10-16T04:57:53Z)
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training [1.5349686675266894]
Current methods for content safety in Large Language Models (LLMs) rely on multi-stage training pipelines.<n>We propose a unified co-training framework that efficiently integrates multiple safety behaviors.<n>We show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance.
arXiv Detail & Related papers (2025-08-12T02:39:33Z)
Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging [47.33307521558814]
Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting.<n>We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance.
arXiv Detail & Related papers (2024-12-27T08:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.