Related papers: The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

URL: http://arxiv.org/abs/2507.11544v1
Date: Tue, 08 Jul 2025 23:58:01 GMT
Title: The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
Authors: Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave, Chris Cundy,
Abstract summary: We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models.<n>We evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families.<n>Our experiments reveal that the safety gap widens as model scale increases and effective dangerous capabilities grow substantially when safeguards are removed.
Score: 5.984437476321095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a 'safety gap': the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective dangerous capabilities grow substantially when safeguards are removed. We hope that the Safety Gap Toolkit (https://github.com/AlignmentResearch/safety-gap) will serve as an evaluation framework for common open-source models and as a motivation for developing and testing tamper-resistant safeguards. We welcome contributions to the toolkit from the community.

Related papers

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z)
How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation [39.44000290664494]
Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability.<n>This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem.<n>We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs.
arXiv Detail & Related papers (2025-02-20T12:07:40Z)
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals [50.463399903987245]
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content.<n>We show that LLMs can similarly perform internal assessments about safety in their internal states.<n>We propose SafeSwitch, a framework that regulates unsafe outputs by utilizing the prober-based internal state monitor.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update [8.739132798784777]
Vision-language models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content.<n>We propose an textbfinternal activation revision approach that efficiently revises activations during generation.<n>Our framework incorporates revisions at both the layer and head levels, offering control over the model's generation at varying levels of granularity.
arXiv Detail & Related papers (2025-01-24T06:17:22Z)
Overriding Safety protections of Open-source Models [4.093963624562595]
In this paper, we study how much of impact introduction of harmful data in fine-tuning can make. We explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy. For the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel.
arXiv Detail & Related papers (2024-09-28T22:53:27Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance [48.36220909956064]
SafeAligner is a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks.<n>We develop two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses.<n>We show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones.
arXiv Detail & Related papers (2024-06-26T07:15:44Z)
Watch the Watcher! Backdoor Attacks on Security-Enhancing Diffusion Models [65.30406788716104]
This work investigates the vulnerabilities of security-enhancing diffusion models. We demonstrate that these models are highly susceptible to DIFF2, a simple yet effective backdoor attack. Case studies show that DIFF2 can significantly reduce both post-purification and certified accuracy across benchmark datasets and models.
arXiv Detail & Related papers (2024-06-14T02:39:43Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.