Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection
- URL: http://arxiv.org/abs/2601.12033v1
- Date: Sat, 17 Jan 2026 12:21:47 GMT
- Title: Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection
- Authors: Muhammad Alif Al Hakim, Alfan Farizki Wicaksono, Fajri Koto,
- Abstract summary: We study how static and dynamic quantization methods impact fairness and safety across benchmarks measuring intrinsic and extrinsic bias and safety alignment.<n>For fairness, we evaluate English, French, Dutch, Spanish, and Turkish; for safety, we focus on English, Korean, and Arabic.<n>We introduce Critical Weight Protection, a novel technique that identifies and preserves fairness- and safety-critical weights during quantization.
- Score: 12.263642529535987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantization is widely adopted to reduce the computational cost of large language models (LLMs); however, its implications for fairness and safety, particularly in dynamic quantization and multilingual contexts, remain underexplored. In this work, we conduct a systematic study of how static and dynamic quantization methods impact fairness and safety across benchmarks measuring intrinsic and extrinsic bias and safety alignment. For fairness, we evaluate English, French, Dutch, Spanish, and Turkish; for safety, we focus on English, Korean, and Arabic. Our findings reveal that quantization consistently degrades fairness and safety, with dynamic methods demonstrating greater stability than static ones. Moreover, fairness degradation varies across languages, while safety deterioration is especially pronounced in non-English settings. To address these risks, we introduce Critical Weight Protection, a novel technique that identifies and preserves fairness- and safety-critical weights during quantization. This approach effectively mitigates bias and safety deterioration without costly retraining or alignment, maintaining trustworthiness while retaining efficiency.
Related papers
- BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z) - UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models [37.68831497886983]
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments.<n>We present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets.<n>We propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs.
arXiv Detail & Related papers (2025-06-25T08:52:22Z) - Safely Learning Controlled Stochastic Dynamics [61.82896036131116]
We introduce a method that ensures safe exploration and efficient estimation of system dynamics.<n>After training, the learned model enables predictions of the system's dynamics and permits safety verification of any given control.<n>We provide theoretical guarantees for safety and derive adaptive learning rates that improve with increasing Sobolev regularity of the true dynamics.
arXiv Detail & Related papers (2025-06-03T11:17:07Z) - Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study [3.314410266204751]
Existing safety alignment methods are English-centric, limiting their effectiveness.<n>We systematically compare Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Kahneman-Tversky Optimization (KTO) for aligning SEA-Lion-v2.1-Instruct, a Llama 3-8B variant, to reduce toxicity in Singlish.<n>Our results show that SFT+KTO achieves superior safety alignment with higher sample efficiency than DPO. Additionally, we introduce KTO-S, which enhances stability via improved KL divergence regularization.
arXiv Detail & Related papers (2025-02-18T03:11:06Z) - A Physics-Informed Machine Learning Framework for Safe and Optimal Control of Autonomous Systems [8.347548017994178]
Safety and performance could be competing objectives, which makes their co-optimization difficult.<n>We propose a state-constrained optimal control problem, where performance objectives are encoded via a cost function and safety requirements are imposed as state constraints.<n>We demonstrate that the resultant value function satisfies a Hamilton-Jacobi-Bellman equation, which we approximate efficiently using a novel machine learning framework.
arXiv Detail & Related papers (2025-02-16T09:46:17Z) - Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications.<n>Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness.<n>Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z) - SAFETY-J: Evaluating Safety with Critique [24.723999605458832]
We introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment.
We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention.
Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios.
arXiv Detail & Related papers (2024-07-24T08:04:00Z) - Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.<n>textscSafePatching achieves a more comprehensive PSA than baseline methods.<n>textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement
Learning in Unknown Stochastic Environments [84.3830478851369]
We propose a safe reinforcement learning approach that can jointly learn the environment and optimize the control policy.
Our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.
arXiv Detail & Related papers (2022-09-29T20:49:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.