Overriding Safety protections of Open-source Models
- URL: http://arxiv.org/abs/2409.19476v1
- Date: Sat, 28 Sep 2024 22:53:27 GMT
- Title: Overriding Safety protections of Open-source Models
- Authors: Sachin Kumar,
- Abstract summary: In this paper, we study how much of impact introduction of harmful data in fine-tuning can make.
We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy.
For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
- Score: 4.093963624562595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs(Large Language Models) nowadays have widespread adoption as a tool for solving issues across various domain/tasks. These models since are susceptible to produce harmful or toxic results, inference-time adversarial attacks, therefore they do undergo safety alignment training and Red teaming for putting in safety guardrails. For using these models, usually fine-tuning is done for model alignment on the desired tasks, which can make model more aligned but also make it more susceptible to produce unsafe responses, if fine-tuned with harmful data.In this paper, we study how much of impact introduction of harmful data in fine-tuning can make, and if it can override the safety protection of those models. Conversely,it was also explored that if model is fine-tuned on safety data can make the model produce more safer responses. Further we explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy because of increase in model uncertainty leading to knowledge drift. Our extensive experimental results shown that Safety protection in an open-source can be overridden, when fine-tuned with harmful data as observed by ASR increasing by 35% when compared to basemodel's ASR. Also, as observed, fine-tuning a model with harmful data made the harmful fine-tuned model highly uncertain with huge knowledge drift and less truthfulness in its responses. Furthermore, for the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel, and Safe model also shown in minor drop in uncertainty and truthfulness as compared to basemodel. This paper's code is available at: https://github.com/techsachinkr/Overriding_Model_Safety_Protections
Related papers
- The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.
We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z) - SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models [63.63254955809224]
We propose a binary router that distinguishes hard examples from easy ones.
Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy.
Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance.
arXiv Detail & Related papers (2025-02-18T02:51:17Z) - Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.
Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.
We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z) - Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts.
We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter.
We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z) - NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning [37.024666077902225]
A handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model.
Existing methods to counteract fine-tuning attacks typically require substantial computational resources.
We propose textbfNeuron-textbfLevel textbfSafety textbfRealignment.
arXiv Detail & Related papers (2024-12-17T02:59:04Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z) - Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack.
When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model.
Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.