EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
- URL: http://arxiv.org/abs/2511.09880v1
- Date: Fri, 14 Nov 2025 01:15:30 GMT
- Title: EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
- Authors: Jialin Wu, Kecen Li, Zhicong Huang, Xinfeng Li, Xiaofeng Wang, Cheng Hong,
- Abstract summary: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving.<n>We introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining.
- Score: 23.11474404054016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.
Related papers
- Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z) - Reimagining Safety Alignment with An Image [49.33281424100804]
Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries.<n>We propose Magic Image, an optimization-driven visual prompt framework that enhances security while reducing over-refusal.
arXiv Detail & Related papers (2025-11-01T11:27:07Z) - Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z) - SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning [6.740032154591022]
We introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation.<n>SafeTuneBed curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks.<n>It enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair.
arXiv Detail & Related papers (2025-05-31T19:00:58Z) - Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.<n>Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z) - Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning [1.3307486544794784]
Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety.
This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification.
Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
arXiv Detail & Related papers (2024-09-18T08:04:24Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - Scaling #DNN-Verification Tools with Efficient Bound Propagation and
Parallel Computing [57.49021927832259]
Deep Neural Networks (DNNs) are powerful tools that have shown extraordinary results in many scenarios.
However, their intricate designs and lack of transparency raise safety concerns when applied in real-world applications.
Formal Verification (FV) of DNNs has emerged as a valuable solution to provide provable guarantees on the safety aspect.
arXiv Detail & Related papers (2023-12-10T13:51:25Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.