How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability
- URL: http://arxiv.org/abs/2408.12259v2
- Date: Wed, 12 Feb 2025 19:32:28 GMT
- Title: How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability
- Authors: Ora Nova Fandina, Leshem Choshen, Eitan Farchi, George Kour, Yotam Perlitz, Orna Raz,
- Abstract summary: A harmfulness evaluation metric is intended to filter unsafe responses from a Large Language Model.
When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score.
Yet, if those same pairs are labelled, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter.
We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour.
- Score: 9.355471292024061
- License:
- Abstract: Consider a scenario where a harmfulness evaluation metric intended to filter unsafe responses from a Large Language Model. When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score. Yet, if those same pairs are concatenated, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter. We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour. Moreover, they show a strong sensitivity to input order: responses are often classified as safe if safe content appears first, regardless of any harmful content that follows, and vice versa. These findings underscore the importance of evaluating the safety of safety metrics, that is, the reliability of their output scores. To address this, we developed general, automatic, concatenation-based tests to assess key properties of these metrics. When applied in a model safety scenario, the tests revealed significant inconsistencies in harmfulness evaluations.
Related papers
- ELITE: Enhanced Language-Image Toxicity Evaluation for Safety [22.371913404553545]
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs.
Existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations.
We propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator.
arXiv Detail & Related papers (2025-02-07T08:43:15Z) - SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.
Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.
It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders [5.070104802923903]
Unsafe prompts pose a significant threat to Large Language Models (LLMs)
This paper investigates the potential of sentence encoders to distinguish safe from unsafe prompts.
We introduce new pairwise datasets and the Categorical Purity metric to measure this capability.
arXiv Detail & Related papers (2024-07-09T13:35:54Z) - Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.
To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.
Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z) - SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models [5.6874111521946356]
Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms.
We propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy.
HarmEval is a novel benchmark for extensive safety evaluations.
arXiv Detail & Related papers (2024-06-18T05:03:23Z) - Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.
textscSafePatching achieves a more comprehensive PSA than baseline methods.
textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - Certifying LLM Safety against Adversarial Prompting [70.96868018621167]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.
We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z) - SafeText: A Benchmark for Exploring Physical Safety in Language Models [62.810902375154136]
We study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks.
We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice.
arXiv Detail & Related papers (2022-10-18T17:59:31Z) - Bayes Security: A Not So Average Metric [20.60340368521067]
Security system designers favor worst-case security metrics, such as those derived from differential privacy (DP)
In this paper, we study Bayes security, a security metric inspired by the cryptographic advantage.
arXiv Detail & Related papers (2020-11-06T14:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.