Related papers: POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

URL: http://arxiv.org/abs/2410.12999v1
Date: Wed, 16 Oct 2024 19:56:22 GMT
Title: POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Authors: Batuhan K. Karaman, Ishmam Zabir, Alon Benhaim, Vishrav Chaudhary, Mert R. Sabuncu, Xia Song,
Abstract summary: Balancing safety and usefulness in large language models has become a critical challenge in recent years. We present a strategy to use preference optimization methods in order to reduce overrefusal, via employing a superior teacher model's completions. Our results show that overgenerating completions for general-purpose prompts significantly improves the balance between safety and usefulness.
Score: 36.27759448564185
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Balancing safety and usefulness in large language models has become a critical challenge in recent years. Models often exhibit unsafe behavior or adopt an overly cautious approach, leading to frequent overrefusal of benign prompts, which reduces their usefulness. Addressing these issues requires methods that maintain safety while avoiding overrefusal. In this work, we examine how the overgeneration of training data using advanced teacher models (e.g., GPT-4o), including responses to both general-purpose and toxic prompts, influences the safety and overrefusal balance of instruction-following language models. Additionally, we present POROver, a strategy to use preference optimization methods in order to reduce overrefusal, via employing a superior teacher model's completions. Our results show that overgenerating completions for general-purpose prompts significantly improves the balance between safety and usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 70.8% to 88.3%. Moreover, overgeneration for toxic prompts substantially reduces overrefusal, decreasing it from 94.4% to 45.2%. Furthermore, preference optimization algorithms, when applied with carefully curated preference data, can effectively reduce a model's overrefusal from 45.2% to 15.0% while maintaining comparable safety levels. Our code and data are available at https://github.com/batuhankmkaraman/POROver.

Related papers

Safety Pretraining: Toward the Next Generation of Safe AI [61.2816320807586]
We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date, generated via recontextualization of harmful web data; and (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content.
arXiv Detail & Related papers (2025-04-23T17:58:08Z)
Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models [70.78205685001168]
We study how to safeguard model unlearning in large language models (LLMs) Our goal is to prevent unlearned models from recalling any related memory of the targeted knowledge. We propose PERMU, a perturbation-based method that significantly enhances the generalisation capabilities for safeguarding LLM unlearning.
arXiv Detail & Related papers (2025-02-27T11:03:33Z)
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z)
Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts. We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter. We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z)
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation. LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs. This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z)
Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner. Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z)
Rule Based Rewards for Language Model Safety [14.444217964594108]
Rule Based Rewards (RBR) uses a collection of rules for desired or undesired behaviors. RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7.
arXiv Detail & Related papers (2024-11-02T02:22:21Z)
Overriding Safety protections of Open-source Models [4.093963624562595]
In this paper, we study how much of impact introduction of harmful data in fine-tuning can make. We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy. For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
arXiv Detail & Related papers (2024-09-28T22:53:27Z)
Alignment with Preference Optimization Is All You Need for LLM Safety [5.063347837245749]
We apply various alignment techniques to the Falcon 11B model using safety datasets. We achieve a significant boost in global safety score as measured by LlamaGuard 3 8B, competing with state-of-the-art models. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math.
arXiv Detail & Related papers (2024-09-12T06:10:15Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern. We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Safe MDP Planning by Learning Temporal Patterns of Undesirable Trajectories and Averting Negative Side Effects [27.41101006357176]
In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. operating based on an incomplete model can often produce unintended negative side effects (NSEs)
arXiv Detail & Related papers (2023-04-06T14:03:24Z)
Chance-Constrained Trajectory Optimization for Safe Exploration and Learning of Nonlinear Systems [81.7983463275447]
Learning-based control algorithms require data collection with abundant supervision for training. We present a new approach for optimal motion planning with safe exploration that integrates chance-constrained optimal control with dynamics learning and feedback control.
arXiv Detail & Related papers (2020-05-09T05:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.