Related papers: From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards

From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards

URL: http://arxiv.org/abs/2403.13213v4
Date: Fri, 5 Jul 2024 15:40:13 GMT
Title: From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards
Authors: Khaoula Chehbouni, Megha Roshan, Emmanuel Ma, Futian Andrew Wei, Afaf Taik, Jackie CK Cheung, Golnoosh Farnadi,
Abstract summary: We investigate the effectiveness of safety measures by evaluating models on already mitigated biases. We create a set of non-toxic prompts, which we then use to evaluate Llama models. We observe that the safety/helpfulness trade-offs are more pronounced for certain demographic groups which can lead to quality-of-service harms.
Score: 4.0645651835677565
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in large language models (LLMs) has led to their widespread adoption in various domains. However, these advancements have also introduced additional safety risks and raised concerns regarding their detrimental impact on already marginalized populations. Despite growing mitigation efforts to develop safety safeguards, such as supervised safety-oriented fine-tuning and leveraging safe reinforcement learning from human feedback, multiple concerns regarding the safety and ingrained biases in these models remain. Furthermore, previous work has demonstrated that models optimized for safety often display exaggerated safety behaviors, such as a tendency to refrain from responding to certain requests as a precautionary measure. As such, a clear trade-off between the helpfulness and safety of these models has been documented in the literature. In this paper, we further investigate the effectiveness of safety measures by evaluating models on already mitigated biases. Using the case of Llama 2 as an example, we illustrate how LLMs' safety responses can still encode harmful assumptions. To do so, we create a set of non-toxic prompts, which we then use to evaluate Llama models. Through our new taxonomy of LLMs responses to users, we observe that the safety/helpfulness trade-offs are more pronounced for certain demographic groups which can lead to quality-of-service harms for marginalized populations.

Related papers

Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z)
Safety Alignment Can Be Not Superficial With Explicit Safety Signals [8.297367440457508]
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially.<n>This paper identifies a fundamental cause of this superficiality: existing alignment approaches presume that models can implicitly learn a safety-related reasoning task during the alignment process.<n>By explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity.
arXiv Detail & Related papers (2025-05-19T20:40:46Z)
The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents [29.974647411289826]
Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. It remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. We propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling.
arXiv Detail & Related papers (2025-02-28T06:18:50Z)
How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation [39.44000290664494]
Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs.
arXiv Detail & Related papers (2025-02-20T12:07:40Z)
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness. Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z)
Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a comprehensive benchmark to evaluate the safety of large language models (LLMs) Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z)
On Evaluating the Durability of Safeguards for Open-Weight LLMs [80.36750298080275]
We discuss whether technical safeguards can impede the misuse of large language models (LLMs) We show that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models.
arXiv Detail & Related papers (2024-12-10T01:30:32Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models? [27.46416187893547]
Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs) into Large Vision-Language Models (LVLMs) Despite potential harmfulness due to weakened safety measures, in-depth analysis on the effects of VL adaptation on safety remains under-explored.
arXiv Detail & Related papers (2024-10-10T03:12:03Z)
Evaluating Defences against Unsafe Feedback in RLHF [26.872318173182414]
This paper looks at learning from unsafe feedback with reinforcement learning. We find that safety-aligned LLMs easily explore unsafe action spaces via generating harmful text. In order to protect against this vulnerability, we adapt a number of both "implict" and "explicit" harmful fine-tuning defences.
arXiv Detail & Related papers (2024-09-19T17:10:34Z)
Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue. We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [65.06446825020578]
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape.
arXiv Detail & Related papers (2024-05-27T17:31:56Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents [7.33319373357049]
This paper introduces SMARLA, a black-box safety monitoring approach specifically designed for Deep Reinforcement Learning (DRL) agents. SMARLA utilizes machine learning to predict safety violations by observing the agent's behavior during execution. Empirical results reveal that SMARLA is accurate at predicting safety violations, with a low false positive rate, and can predict violations at an early stage, approximately halfway through the execution of the agent, before violations occur.
arXiv Detail & Related papers (2023-08-03T21:08:51Z)
Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements [76.80453043969209]
This survey presents a framework for safety research pertaining to large models. We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models. We explore the strategies for enhancing large model safety from training to deployment.
arXiv Detail & Related papers (2023-02-18T09:32:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.