Alignment-Aware Quantization for LLM Safety
- URL: http://arxiv.org/abs/2511.07842v1
- Date: Wed, 12 Nov 2025 01:23:43 GMT
- Title: Alignment-Aware Quantization for LLM Safety
- Authors: Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak,
- Abstract summary: Safety and efficiency are important factors when deploying large language models (LLMs)<n>We propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive( APC) loss into the PTQ pipeline.<n>AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families.
- Score: 30.635936212381726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.
Related papers
- BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z) - Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment [55.14890249389052]
Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction.<n>We propose textttQ-realign, a post-hoc defense method based on post-training quantization.<n>Our work provides a practical, turnkey solution for safety-aware deployment.
arXiv Detail & Related papers (2026-01-13T00:07:24Z) - Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance [20.0828672005664]
We show that safety alignment can be fully recovered with only a single safety example.<n>We uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible.
arXiv Detail & Related papers (2026-01-05T08:26:34Z) - Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs [49.76354497916853]
Harmonious Adaptation (HPA) is a post-training framework composed of focusing-based parameter partition, harmonious balanced parameter selection, and parameter adjustment.<n>HPA better maintains high safety and mitigates forgetting than existing baselines.<n> experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.
arXiv Detail & Related papers (2025-11-25T10:34:51Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models [37.68831497886983]
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments.<n>We present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets.<n>We propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs.
arXiv Detail & Related papers (2025-06-25T08:52:22Z) - Learning Safety Constraints for Large Language Models [41.95596134688853]
Large language models (LLMs) pose significant safety risks through harmful outputs and vulnerability to adversarial attacks.<n>We propose SaP, a geometric approach to safety that learns and enforces multiple safety constraints directly in the model's representation space.<n>We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs.
arXiv Detail & Related papers (2025-05-30T10:30:24Z) - Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models [16.30545036335344]
We release a human-curated safety dataset with 1.067 challenging questions to rigorously evaluate model behavior.<n>We assess 66 quantized variants of four large language models using four post-training quantization (PTQ) and two quantization-aware training (QAT) methods.<n>Our results show both PTQ and QAT can degrade safety alignment, with QAT techniques like QLORA or STE performing less safely.
arXiv Detail & Related papers (2025-02-18T20:32:05Z) - Superficial Safety Alignment Hypothesis [15.215130286922564]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction.<n>We identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU) and Redundant Unit (RU)<n>Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.