Related papers: Alleviating the Fear of Losing Alignment in LLM Fine-tuning

Alleviating the Fear of Losing Alignment in LLM Fine-tuning

URL: http://arxiv.org/abs/2504.09757v1
Date: Sun, 13 Apr 2025 23:47:16 GMT
Title: Alleviating the Fear of Losing Alignment in LLM Fine-tuning
Authors: Kang Yang, Guanhong Tao, Xun Chen, Jun Xu,
Abstract summary: Large language models (LLMs) can answer questions that are unethical or harmful, raising concerns about their applications.<n>This paper focuses on recovering the alignment lost during fine-tuning.<n>Our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25% to 1.74%, without sacrificing task performance much.
Score: 26.219350136041328
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called \textit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the \textit{aligned direction} and the \textit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25\% to 1.74\%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment

Related papers

Differentially Private Steering for Large Language Model Alignment [55.30573701583768]
We present the first study of aligning Large Language Models with private datasets.<n>Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit activations with differential privacy guarantees.<n>Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance.
arXiv Detail & Related papers (2025-01-30T17:58:36Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.<n>This vulnerability poses significant risks to the real-world applications.<n>We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs [6.627477206883248]
Large language models (LLMs) are trained on a deluge of text data with limited quality control. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective.
arXiv Detail & Related papers (2024-08-02T17:55:50Z)
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment [82.99849359892112]
We re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Findings indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior.
arXiv Detail & Related papers (2024-06-25T16:32:33Z)
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content. We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z)
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition [10.476666078206783]
Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks. We propose PARDEN, which avoids the domain shift by simply asking the model to repeat its own outputs.
arXiv Detail & Related papers (2024-05-13T17:08:42Z)
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs [31.80386572346993]
We exploit the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
arXiv Detail & Related papers (2023-12-08T01:41:36Z)
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning [61.68787689234622]
A recent study, LIMA, shows that using merely 1K examples for alignment tuning can achieve significant alignment performance as well. This raises questions about how exactly the alignment tuning transforms a base LLM. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting.
arXiv Detail & Related papers (2023-12-04T00:46:11Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Making Harmful Behaviors Unlearnable for Large Language Models [50.44915524846857]
Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
arXiv Detail & Related papers (2023-11-02T09:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.