How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability
- URL: http://arxiv.org/abs/2408.12259v2
- Date: Wed, 12 Feb 2025 19:32:28 GMT
- Title: How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability
- Authors: Ora Nova Fandina, Leshem Choshen, Eitan Farchi, George Kour, Yotam Perlitz, Orna Raz,
- Abstract summary: A harmfulness evaluation metric is intended to filter unsafe responses from a Large Language Model.<n>When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score.<n>Yet, if those same pairs are labelled, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter.<n>We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour.
- Score: 9.355471292024061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Consider a scenario where a harmfulness evaluation metric intended to filter unsafe responses from a Large Language Model. When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score. Yet, if those same pairs are concatenated, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter. We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour. Moreover, they show a strong sensitivity to input order: responses are often classified as safe if safe content appears first, regardless of any harmful content that follows, and vice versa. These findings underscore the importance of evaluating the safety of safety metrics, that is, the reliability of their output scores. To address this, we developed general, automatic, concatenation-based tests to assess key properties of these metrics. When applied in a model safety scenario, the tests revealed significant inconsistencies in harmfulness evaluations.
Related papers
- UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases [33.50554956301584]
We release UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources.<n>We fine-tune three large reasoning models (LRMs) and compare them against recent SafeChain and STAR-1.<n>UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance.
arXiv Detail & Related papers (2025-07-29T10:08:52Z) - HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs [51.90597846977058]
Video-SafetyBench is the first benchmark designed to evaluate the safety of LVLMs under video-text attacks.<n>It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories.<n>To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images and motion text.
arXiv Detail & Related papers (2025-05-17T05:06:38Z) - Safety Pretraining: Toward the Next Generation of Safe AI [61.2816320807586]
We present a data-centric pretraining framework that builds safety into the model from the start.
Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date, generated via recontextualization of harmful web data; and (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content.
arXiv Detail & Related papers (2025-04-23T17:58:08Z) - Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.
We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)
UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z) - SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.
Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.
It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
Commit message generation is a crucial task in software engineering that is challenging to evaluate correctly.
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.
Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z) - Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders [5.070104802923903]
Unsafe prompts pose a significant threat to Large Language Models (LLMs)
This paper investigates the potential of sentence encoders to distinguish safe from unsafe prompts.
We introduce new pairwise datasets and the Categorical Purity metric to measure this capability.
arXiv Detail & Related papers (2024-07-09T13:35:54Z) - Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.
To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.
Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models [5.6874111521946356]
Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms.
We propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy.
HarmEval is a novel benchmark for extensive safety evaluations.
arXiv Detail & Related papers (2024-06-18T05:03:23Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in
Large Language Models [15.896567445646784]
We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks.
The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with.
While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme.
arXiv Detail & Related papers (2023-11-14T18:33:43Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - Certifying LLM Safety against Adversarial Prompting [75.19953634352258]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.
We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z) - APPLS: Evaluating Evaluation Metrics for Plain Language Summarization [18.379461020500525]
This study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for Plain Language Summarization (PLS)
We identify four PLS criteria from previous work and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect.
Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
arXiv Detail & Related papers (2023-05-23T17:59:19Z) - Untargeted Near-collision Attacks on Biometrics: Real-world Bounds and
Theoretical Limits [0.0]
We focus on untargeted attacks that can be carried out both online and offline, and in both identification and verification modes.
We use the False Match Rate (FMR) and the False Positive Identification Rate (FPIR) to address the security of these systems.
The study of this metric space, and system parameters, gives us the complexity of untargeted attacks and the probability of a near-collision.
arXiv Detail & Related papers (2023-04-04T07:17:31Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - Bayes Security: A Not So Average Metric [20.60340368521067]
Security system designers favor worst-case security metrics, such as those derived from differential privacy (DP)
In this paper, we study Bayes security, a security metric inspired by the cryptographic advantage.
arXiv Detail & Related papers (2020-11-06T14:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.