Related papers: Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

URL: http://arxiv.org/abs/2405.16833v2
Date: Sun, 05 Jan 2025 21:51:46 GMT
Title: Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Authors: Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, Chun-Ying Huang,
Abstract summary: We propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace.<n>Our experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model.
Score: 51.20476412037321
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) such as Llama-2 or GPT-4 have shown impressive zero-shot performance, fine-tuning is still necessary to enhance their performance for customized datasets, domain-specific tasks, or other private needs. However, fine-tuning all parameters of LLMs requires significant hardware resources, which can be impractical for typical users. Therefore, parameter-efficient fine-tuning such as LoRA have emerged, allowing users to fine-tune LLMs without the need for considerable computing resources, with little performance degradation compared to fine-tuning all parameters. Unfortunately, recent studies indicate that fine-tuning can increase the risk to the safety of LLMs, even when data does not contain malicious content. To address this challenge, we propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace, effectively reducing the safety risks in LLM fine-tuning while maintaining utility. It is worth noting that Safe LoRA is a training-free and data-free approach, as it only requires the knowledge of the weights from the base and aligned LLMs. Our extensive experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model. Moreover, when the fine-tuning dataset contains a mixture of both benign and malicious data, Safe LoRA mitigates the negative effect made by malicious data while preserving performance on downstream tasks. Our codes are available at \url{https://github.com/IBM/SafeLoRA}.

Related papers

Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity. LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning. LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z)
In-Context Meta LoRA Generation [61.690065588534296]
Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. We propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods.
arXiv Detail & Related papers (2025-01-29T13:12:01Z)
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation [41.91948079316541]
Recent studies have raised concerns that LoRA fine-tuning could potentially compromise the safety alignment in large language models. In this paper, we propose Safety-alignment preserved Low-Rank Adaptation (SaLoRA) Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments. Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.
arXiv Detail & Related papers (2025-01-03T11:34:28Z)
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models [30.93821289892195]
We propose IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks.
arXiv Detail & Related papers (2024-12-15T03:58:38Z)
LoRA-Mini : Adaptation Matrices Decomposition and Selective Training [2.0670689746336]
Low-Rank Adaptation (LoRA) has emerged as a promising solution, enabling parameter-efficient fine-tuning by reducing the number of trainable parameters. We propose LoRA-Mini, an optimized adaptation of LoRA that improves parameter efficiency by splitting low-rank matrices into four parts. This approach achieves upto a 20x reduction compared to standard LoRA in the number of trainable parameters while preserving performance levels comparable to standard LoRA.
arXiv Detail & Related papers (2024-11-24T12:21:14Z)
LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement [5.162783756846019]
Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning. Low-Rank Adaptation (LoRA) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods.
arXiv Detail & Related papers (2024-11-22T14:19:01Z)
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
LoRA-Pro: Are Low-Rank Adapters Properly Optimized? [121.0693322732454]
Low-rank adaptation, also known as LoRA, has emerged as a prominent method for parameter-efficient fine-tuning of foundation models. Despite its computational efficiency, LoRA still yields inferior performance compared to full fine-tuning. We introduce LoRA-Pro, a method that enhances LoRA's performance by strategically adjusting the gradients of low-rank matrices.
arXiv Detail & Related papers (2024-07-25T17:57:12Z)
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models [15.900125475191958]
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs) We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
arXiv Detail & Related papers (2024-07-03T10:38:40Z)
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [65.06446825020578]
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape.
arXiv Detail & Related papers (2024-05-27T17:31:56Z)
Improving LoRA in Privacy-preserving Federated Learning [44.47315926976059]
Low-rank adaptation (LoRA) is one of the most popular task-specific parameter-efficient fine-tuning (PEFT) methods on pre-trained language models. This paper proposes an efficient and effective version of LoRA, Federated Freeze A LoRA (FFA-LoRA), to alleviate these challenges.
arXiv Detail & Related papers (2024-03-18T23:20:08Z)
Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning [31.036465632204663]
We introduce Chain of LoRA, an iterative optimization framework inspired by the Frank-Wolfe algorithm. We demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.
arXiv Detail & Related papers (2024-01-08T14:26:49Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices [78.16679232748196]
We introduce a Compression-Aware LoRA (CA-LoRA) framework to transfer Large Language Models (LLMs) to other tasks. Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM. The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.
arXiv Detail & Related papers (2023-07-15T04:37:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.